B+ Tree Data Structure Implementation Project

The project was done using C++ under an object-oriented based approach. The project aims to store and index over a million IMDb rating records under a database management system utilising B+ trees.

Design of storage component

The above diagram highlighted in red shows the brief structure overview of the storage component implemented in this assignment.

The design of this storage component aims to depict how the hard disk would fetch data blocks into the main memory. To mimic this storage component, we have created a storage object that contains multiple block objects.

A vector of shared pointers point to the block object to keep track of them. This vector of shared pointers pointing to the block object will facilitate the leaf tree node’s access to the required blocks data. With reference to the diagram above, the block object will store multiple records objects.

This record object will contain a tuple of fields such as Tconst, avgRating and numVotes.

3.1. Reading and storing data.tsv records

The file data.tsv is read from the main.cpp file using ifstream and .open.

It then reads every record from the file sequentially and inserts it into the B+ tree for the attribute numVotes.

For visibility, it will display every 100,000th record that has completed reading/insertion. As data.tsv has about 1M+ records, this line is expected to be printed 11 times.

After going through the records, the recordNum variable will store the index of the final record (which is also the total number of records) that was read. As recordNum started from 0, the total number of records read would be recordNum -1.

The output is as follows:

3.2. Storage: storage.cpp

The Storage struct is a singleton structure representing the disk storage of a database system.

It holds multiple blocks in memory, wherein each block holds multiple records.

Attributes:

vector<std::shared_ptr<Block>> blocks

Vector of pointers to blocks

Methods:

void addBlock(std::shared_ptr<Block> block)

Adds the pointer of the block to the vector of shared pointers to block

int getNoBlocks()

returns the number of blocks stored in the storage structure

int getNoRecords()

returns the number of records stored in the storage structure

int getStorageSize()

returns the total storage size used

3.3. Data Blocks: block.cpp

In the Block.h file, block is defined as a structure. The block structure will contain the following instances and method:

Attributes:

uint16_t maxRecord

Stores the maximum number of record object stored in the block object

vector<Record> records

Vector of record object that holds the list of records

Methods:

void addRecord(Record record)

Takes in a record object as parameter and adds record object to the vector of record object

void DeleteRecord(float key)

Takes in the key as parameter and delete record object containing the key value in the vector of record object

int getNumRecords()

Returns the number of records in the block

int getBlockSize()

Returns the size of the block’s content

vector<Record> getRecord(float key)

Returns the vector of record object that contains the specified key value

void toString()

Displays the tconst value in the record object contained in the block

bool notFull()

Return a boolean expression to determine if the block is full

3.4. Record: record.cpp

Record is defined as a struct in record.h, with attributes tconst, averageRating and numVotes which correspond to the columns for each movie record in the dataset. Each struct is initialised and used to store the data of a movie record in the database.

Attributes:

string tconst

float averageRating

int numVotes

Methods:

string toString()

Returns the string representation of the record for output formatting.

int getSize

Returns the size of the record in bytes for use when inserting a record in a block.

Design of B+ tree component

4.1. Brief Overview

The above diagram highlighted in red shows the brief design overview of the B+ tree component implemented in this assignment.

The B+ Tree is created by linking multiple Nodes. Contained within each node is a vector containing all the keys stored within the node, as well as a corresponding vector that contains pointers to either other nodes, or to blocks that contain records equal to that of the key value. This depends on whether the node is an internal node or a leaf node respectively.

4.2. Implementation: bplustree.cpp

4.2.1. Node:

Node is a structure used to help maintain tree nodes in the B+ tree

Contains the attributes:

ptrs:

If isLeaf is False, it is a vector of pointers pointing to other Nodes

Otherwise it is a vector of pointers pointing to blocks containing records whose key is the same as the key value in the node

keys: Vector of floats, which will be used to store the key value

isLeaf: Boolean, set to true if the node is a leaf node, otherwise it is false

size: Unsigned integer which states the maximum number of keys a Node is supposed to have

4.2.2. BPlusTree:

Class that helps to maintain the B+ Tree structure

Contains the attributes:

root: shared_ptr that always points to the root of the tree. When a tree is first initialised, root = nullptr

size: Unsigned integer which represents the maximum number of keys in a tree node (Node) for that particular B+ tree

numNodes: Integer which keeps track of the number of nodes the B+ tree currently hass

The BPlusTree class also contains the following methods:

insertKey

Inserts a key into the B+ Tree

insertInternal

Helper function called by insertKey. Purpose is to help link a new child node to its parent node

deleteKey

Deletes a key from B+ Tree

removeInternal

Helper function called by deleteKey. Purpose is to help break the link between parent node and child node that should be deleted

findRange

Takes in 2 arguments, and will look for all the records with key value that lies within the range of the 2 arguments. If both arguments are equal in value, will find records with key equal to that value

displayNode

Prints the contents of the node

printRecordsInNode

Prints the values of all records in the node

printRecordOfKey

Helper function called by printRecordsInNode. Prints the values of a record

getRoot

Returns a pointer pointing to the root node of the B+ Tree

setRoot

Sets the attribute root in the BPlusTree class to the input pointer

getParent

Returns a pointer pointing to the parent node of the input node

getNumNodes

Returns the attribute numNodes

getHeight

Calculates and returns the height of the B+ Tree

printTreeAttributes

Prints the relevant attributes of the B+ Tree

4.2.3. Implementation of insertKey:

Case 1: Root is Null

This means that the B+ Tree is empty

We create a new node, then insert the key into the newly created node

Case 2: Root exists

Traverse the tree down to it’s leaf node by taking the pointer that corresponds to the key (search key < key value in node)

At the leaf node, we check if there is space in the leaf node

Case 2.1: Leaf node is not completely filled

If there is a duplicate key already in the leaf node, we access that key’s block vector, and insert the new key’s block pointer

Otherwise we insert the new key into the node’s key vectors, then insert the new key’s block pointer into the node’s block pointers

Case 2.2: Leaf node is full, we need to split the node

Create a new leaf node, and 2 temporary vectors for the old leaf node’s keys and block pointers

We check the temporary vector if the key we want to insert already exists

If it exists, we access the old node’s block vector and insert the new key’s block pointer

Otherwise we insert the new key into the temporary key vector, then insert the new key’s block pointer into the temporary block vector

We then split the node into two nodes

Update the parent node’s keys using a recursive function all the way up to root if needed

4.2.4. Implementation of deleteKey:

Case 1: Root is Null

This means that the B+ Tree is empty, return to main function

Case 2: Root exists

Traverse the tree down to it’s leaf node by taking the pointer that corresponds to the key (search key < key value in node)

At the leaf node, we check if there exists a key equals to the search key

Case 2.1: Key does not exist in node

Returns to main function

Case 2.2: Key found

Delete the vector of block pointers that corresponds to the key

Delete the key from the node’s key vectors

Once the deletion is completed, we then check if the node that the key was deleted from is the root node. If it is, we delete the root node

Otherwise we balance the tree, making sure the number of keys in each node is valid by borrowing and moving keys around the tree

This is done by using a recursive function till root (root need can have lesser than minimum number of keys)

4.2.5. Implementation of findRange:

Case 1: begin and end arguments is equal

Searching for records with key equal to value of begin

Case 2: start and end arguments are not equal and begin < end

Searching for records with key in the range of begin < key < end

We first start by traversing down the B+ Tree to the leaf nodes by taking the pointer that corresponds to the key (begin value < key value in node)

Once at leaf node, we scan through the leaf nodes until we find key equals to the value of end

While traversing, we count the number of leafNodes accessed, as well as the number of internalNodes accessed

Returns a vector of block pointers accessed whilst traversing the B+ Tree to find the records

Experiment Results for Block Size: 200B

Experiment 1 (Block Size = 200B)

Number of Blocks

214064

Total Disk Size Used (Size of database)

42 MB

Experiment 2 (Block Size = 200B)

Parameter n of B+ tree

15

Number of nodes in B+ tree

1516

Height of B+ tree (number of levels)

3

Content of root

951 - 2580 - 3951 - 5184 - 6418 - 8356 - 10716 - 14187 - 18473 - 29297 - 50486 - 119615

Content of first child node

96 - 180 - 262 - 356 - 466 - 616 - 740 - 849

Experiment 3 (Block Size = 200B)

Number of index nodes accessed

3

Number of data blocks accessed

110

Average of avgRating

6.73182

Content of first 5 index nodes

951 - 2580 - 3951 - 5184 - 6418 - 8356 - 10716 - 14187 - 18473 - 29297 - 50486 - 119615

96 - 180 - 262 - 356 - 466 - 616 - 740 - 849

479 - 490 - 499 - 507 - 514 - 524 - 533 - 541 - 548 - 559 - 569 - 583 - 594 - 604

499 - 500 - 501 - 502 - 503 - 504 - 505 - 506

Content of first 5 data blocks

| tt0013674 | tt0013679 | tt0013681 | tt0013682 | tt0013687 |

| tt0024558 | tt0024559 | tt0024560 | tt0024561 | tt0024562 |

| tt0028276 | tt0028277 | tt0028278 | tt0028279 | tt0028280 |

| tt0041952 | tt0041953 | tt0041954 | tt0041955 | tt0041956 |

| tt0047360 | tt0047361 | tt0047362 | tt0047363 | tt0047364 |

Experiment 4 (Block Size = 200B)

Number of index nodes accessed

3

Number of data blocks accessed

953

Average of avgRating

6.72791

Content of first 5 index nodes

951 - 2580 - 3951 - 5184 - 6418 - 8356 - 10716 - 14187 - 18473 - 29297 - 50486 - 119615

30446 - 31333 - 32807 - 34091 - 35272 - 36444 - 37530 - 38968 - 40100 - 41498 - 42764 - 44444 - 46888 - 48648

29353 - 29587 - 29620 - 29730 - 29807 - 29848 - 29975 - 30056 - 30158 - 30246 - 30354

29975 - 29978 - 29982 - 29988 - 29996 - 30022 - 30034 - 30037 - 30041 - 30049 - 30053

30056 - 30078 - 30081 - 30085 - 30090 - 30136 - 30144 - 30149

Content of first 5 data blocks

| tt0054164 | tt0054165 | tt0054166 | tt0054167 | tt0054168 |

| tt0026776 | tt0026777 | tt0026778 | tt0026779 | tt0026781 |

| tt0091824 | tt0091825 | tt0091826 | tt0091827 | tt0091828 |

| tt3361726 | tt3361740 | tt3361784 | tt3361786 | tt3361792 |

| tt1456941 | tt1456944 | tt1456946 | tt1456947 | tt1456948 |

Experiment 5 (Block Size = 200B)

Number of times that a node is deleted or merged

0

Number of nodes in updated B+ tree

1516

Height of updated B+ tree

3

Content of root node

951 - 2580 - 3951 - 5184 - 6418 - 8356 - 10716 - 14187 - 18473 - 29297 - 50486 - 119615

Content of 1st child node

96 - 180 - 262 - 356 - 466 - 616 - 740 - 849

Experiment Results for Block Size: 500B

Experiment 1 (Block Size = 500B)

Number of Blocks

89194

Total Disk Size Used (Size of database)

44MB

Experiment 2 (Block Size = 500B)

Parameter n of B+ tree

40

Number of nodes in B+ tree

514

Height of B+ tree (number of levels)

2

Content of root

746 - 1314 - 1972 - 3022 - 3799 - 4579 - 5547 - 6670 - 8221 - 10287 - 12563 - 15256 - 19538 - 23605 - 29165 - 35411 - 43720 - 54728 - 80231 - 124761 - 236325

Content of first child node

25 - 59 - 89 - 120 - 142 - 163 - 196 - 227 - 247 - 267 - 288 - 312 - 338 - 362 - 396 - 430 - 451 - 474 - 496 - 517 - 541 - 563 - 586 - 610 - 646 - 667 - 687 - 715

Experiment 3 (Block Size = 500B)

Number of index nodes accessed

2

Number of data blocks accessed

110

Average of avgRating

6.6.73182

Content of first 5 index nodes

746 - 1314 - 1972 - 3022 - 3799 - 4579 - 5547 - 6670 - 8221 - 10287 - 12563 - 15256 - 19538 - 23605 - 29165 - 35411 - 43720 - 54728 - 80231 - 124761 - 236325

25 - 59 - 89 - 120 - 142 - 163 - 196 - 227 - 247 - 267 - 288 - 312 - 338 - 362 - 396 - 430 - 451 - 474 - 496 - 517 - 541 - 563 - 586 - 610 - 646 - 667 - 687 - 715

496 - 497 - 498 - 499 - 500 - 501 - 502 - 503 - 504 - 505 - 506 - 507 - 508 - 509 - 510 - 511 - 512 - 513 - 514 - 515 - 516

Content of first 5 data blocks

| tt0013627 | tt0013629 | tt0013631 | tt0013658 | tt0013662 | tt0013668 | tt0013672 | tt0013674 | tt0013679 | tt0013681 | tt0013682 | tt0013687 |

| tt0024553 | tt0024554 | tt0024555 | tt0024558 | tt0024559 | tt0024560 | tt0024561 | tt0024562 | tt0024563 | tt0024564 | tt0024567 | tt0024568 |

| tt0028271 | tt0028272 | tt0028273 | tt0028274 | tt0028275 | tt0028276 | tt0028277 | tt0028278 | tt0028279 | tt0028280 | tt0028281 | tt0028282 |

| tt0041953 | tt0041954 | tt0041955 | tt0041956 | tt0041957 | tt0041958 | tt0041959 | tt0041961 | tt0041962 | tt0041963 | tt0041966 | tt0041967 |

| tt0047355 | tt0047356 | tt0047357 | tt0047358 | tt0047359 | tt0047360 | tt0047361 | tt0047362 | tt0047363 | tt0047364 | tt0047365 | tt0047366 |

Experiment 4 (Block Size = 500B)

Number of index nodes accessed

2

Number of data blocks accessed

953

Average of avgRating

6.72791

Content of first 5 index nodes

746 - 1314 - 1972 - 3022 - 3799 - 4579 - 5547 - 6670 - 8221 - 10287 - 12563 - 15256 - 19538 - 23605 - 29165 - 35411 - 43720 - 54728 - 80231 - 124761 - 236325

29587 - 29814 - 30034 - 30259 - 30492 - 30669 - 30833 - 31017 - 31260 - 31459 - 31674 - 31928 - 32124 - 32511 - 32859 - 33082 - 33357 - 33594 - 33850 - 34484 - 34660 - 34910 - 35163

29814 - 29818 - 29819 - 29823 - 29824 - 29828 - 29834 - 29848 - 29861 - 29869 - 29876 - 29880 - 29882 - 29900 - 29910 - 29919 - 29949 - 29954 - 29956 - 29959 - 29962 - 29974 - 29975 - 29978 - 29982 - 29988 - 29996 - 30022

30034 - 30037 - 30041 - 30049 - 30053 - 30056 - 30078 - 30081 - 30085 - 30090 - 30136 - 30144 - 30149 - 30158 - 30168 - 30175 - 30177 - 30195 - 30206 - 30221 - 30240 - 30246 - 30247 - 30248 - 30254

30259 - 30262 - 30275 - 30319 - 30326 - 30333 - 30341 - 30354 - 30361 - 30370 - 30376 - 30391 - 30395 - 30402 - 30418 - 30423 - 30431 - 30446 - 30453 - 30456 - 30457 - 30458 - 30462 - 30468

Content of first 5 data blocks

| tt0054162 | tt0054164 | tt0054165 | tt0054166 | tt0054167 | tt0054168 | tt0054169 | tt0054170 | tt0054171 | tt0054172 | tt0054173 | tt0054174 |

| tt0026769 | tt0026771 | tt0026772 | tt0026773 | tt0026774 | tt0026775 | tt0026776 | tt0026777 | tt0026778 | tt0026779 | tt0026781 | tt0026783 |

| tt0091825 | tt0091826 | tt0091827 | tt0091828 | tt0091829 | tt0091830 | tt0091831 | tt0091832 | tt0091833 | tt0091834 | tt0091835 | tt0091836 |

| tt3361702 | tt3361726 | tt3361740 | tt3361784 | tt3361786 | tt3361792 | tt3361794 | tt3361812 | tt3361814 | tt3361834 | tt3361856 | tt3361874 |

| tt1456915 | tt1456931 | tt1456937 | tt1456939 | tt1456941 | tt1456944 | tt1456946 | tt1456947 | tt1456948 | tt1456949 | tt1456950 | tt1456953 |

Experiment 5 (Block Size = 500B)

Number of times that a node is deleted or merged

0

Number of nodes in updated B+ tree

514

Height of updated B+ tree

2

Content of root node

746 - 1314 - 1972 - 3022 - 3799 - 4579 - 5547 - 6670 - 8221 - 10287 - 12563 - 15256 - 19538 - 23605 - 29165 - 35411 - 43720 - 54728 - 80231 - 124761 - 236325

Content of 1st child node

25 - 59 - 89 - 120 - 142 - 163 - 196 - 227 - 247 - 267 - 288 - 312 - 338 - 362 - 396 - 430 - 451 - 474 - 496 - 517 - 541 - 563 - 586 - 610 - 646 - 667 - 687 - 715



Back to Projects