Design of storage component
The above diagram highlighted in red shows the brief structure overview of the storage component implemented in this assignment.
The design of this storage component aims to depict how the hard disk would fetch data blocks into the main memory. To mimic this storage component, we have created a storage object that contains multiple block objects.
A vector of shared pointers point to the block object to keep track of them. This vector of shared pointers pointing to the block object will facilitate the leaf tree node’s access to the required blocks data. With reference to the diagram above, the block object will store multiple records objects.
This record object will contain a tuple of fields such as Tconst, avgRating and numVotes.
3.1. Reading and storing data.tsv records
The file data.tsv is read from the main.cpp file using ifstream and .open.
It then reads every record from the file sequentially and inserts it into the B+ tree for the attribute numVotes.
For visibility, it will display every 100,000th record that has completed reading/insertion. As data.tsv has about 1M+ records, this line is expected to be printed 11 times.
After going through the records, the recordNum variable will store the index of the final record (which is also the total number of records) that was read. As recordNum started from 0, the total number of records read would be recordNum -1.
The output is as follows:
3.2. Storage: storage.cpp
The Storage struct is a singleton structure representing the disk storage of a database system.
It holds multiple blocks in memory, wherein each block holds multiple records.
Attributes:
vector<std::shared_ptr<Block>> blocks
Vector of pointers to blocks
Methods:
void addBlock(std::shared_ptr<Block> block)
Adds the pointer of the block to the vector of shared pointers to block
int getNoBlocks()
returns the number of blocks stored in the storage structure
int getNoRecords()
returns the number of records stored in the storage structure
int getStorageSize()
returns the total storage size used
3.3. Data Blocks: block.cpp
In the Block.h file, block is defined as a structure. The block structure will contain the following instances and method:
Attributes:
uint16_t maxRecord
Stores the maximum number of record object stored in the block object
vector<Record> records
Vector of record object that holds the list of records
Methods:
void addRecord(Record record)
Takes in a record object as parameter and adds record object to the vector of record object
void DeleteRecord(float key)
Takes in the key as parameter and delete record object containing the key value in the vector of record object
int getNumRecords()
Returns the number of records in the block
int getBlockSize()
Returns the size of the block’s content
vector<Record> getRecord(float key)
Returns the vector of record object that contains the specified key value
void toString()
Displays the tconst value in the record object contained in the block
bool notFull()
Return a boolean expression to determine if the block is full
3.4. Record: record.cpp
Record is defined as a struct in record.h, with attributes tconst, averageRating and numVotes which correspond to the columns for each movie record in the dataset. Each struct is initialised and used to store the data of a movie record in the database.
Attributes:
string tconst
float averageRating
int numVotes
Methods:
string toString()
Returns the string representation of the record for output formatting.
int getSize
Returns the size of the record in bytes for use when inserting a record in a block.
Design of B+ tree component
4.1. Brief Overview
The above diagram highlighted in red shows the brief design overview of the B+ tree component implemented in this assignment.
The B+ Tree is created by linking multiple Nodes. Contained within each node is a vector containing all the keys stored within the node, as well as a corresponding vector that contains pointers to either other nodes, or to blocks that contain records equal to that of the key value. This depends on whether the node is an internal node or a leaf node respectively.
4.2. Implementation: bplustree.cpp
4.2.1. Node:
Node is a structure used to help maintain tree nodes in the B+ tree
Contains the attributes:
ptrs:
If isLeaf is False, it is a vector of pointers pointing to other Nodes
Otherwise it is a vector of pointers pointing to blocks containing records whose key is the same as the key value in the node
keys: Vector of floats, which will be used to store the key value
isLeaf: Boolean, set to true if the node is a leaf node, otherwise it is false
size: Unsigned integer which states the maximum number of keys a Node is supposed to have
4.2.2. BPlusTree:
Class that helps to maintain the B+ Tree structure
Contains the attributes:
root: shared_ptr that always points to the root of the tree. When a tree is first initialised, root = nullptr
size: Unsigned integer which represents the maximum number of keys in a tree node (Node) for that particular B+ tree
numNodes: Integer which keeps track of the number of nodes the B+ tree currently hass
The BPlusTree class also contains the following methods:
insertKey
Inserts a key into the B+ Tree
insertInternal
Helper function called by insertKey. Purpose is to help link a new child node to its parent node
deleteKey
Deletes a key from B+ Tree
removeInternal
Helper function called by deleteKey. Purpose is to help break the link between parent node and child node that should be deleted
findRange
Takes in 2 arguments, and will look for all the records with key value that lies within the range of the 2 arguments. If both arguments are equal in value, will find records with key equal to that value
displayNode
Prints the contents of the node
printRecordsInNode
Prints the values of all records in the node
printRecordOfKey
Helper function called by printRecordsInNode. Prints the values of a record
getRoot
Returns a pointer pointing to the root node of the B+ Tree
setRoot
Sets the attribute root in the BPlusTree class to the input pointer
getParent
Returns a pointer pointing to the parent node of the input node
getNumNodes
Returns the attribute numNodes
getHeight
Calculates and returns the height of the B+ Tree
printTreeAttributes
Prints the relevant attributes of the B+ Tree
4.2.3. Implementation of insertKey:
Case 1: Root is Null
This means that the B+ Tree is empty
We create a new node, then insert the key into the newly created node
Case 2: Root exists
Traverse the tree down to it’s leaf node by taking the pointer that corresponds to the key (search key < key value in node)
At the leaf node, we check if there is space in the leaf node
Case 2.1: Leaf node is not completely filled
If there is a duplicate key already in the leaf node, we access that key’s block vector, and insert the new key’s block pointer
Otherwise we insert the new key into the node’s key vectors, then insert the new key’s block pointer into the node’s block pointers
Case 2.2: Leaf node is full, we need to split the node
Create a new leaf node, and 2 temporary vectors for the old leaf node’s keys and block pointers
We check the temporary vector if the key we want to insert already exists
If it exists, we access the old node’s block vector and insert the new key’s block pointer
Otherwise we insert the new key into the temporary key vector, then insert the new key’s block pointer into the temporary block vector
We then split the node into two nodes
Update the parent node’s keys using a recursive function all the way up to root if needed
4.2.4. Implementation of deleteKey:
Case 1: Root is Null
This means that the B+ Tree is empty, return to main function
Case 2: Root exists
Traverse the tree down to it’s leaf node by taking the pointer that corresponds to the key (search key < key value in node)
At the leaf node, we check if there exists a key equals to the search key
Case 2.1: Key does not exist in node
Returns to main function
Case 2.2: Key found
Delete the vector of block pointers that corresponds to the key
Delete the key from the node’s key vectors
Once the deletion is completed, we then check if the node that the key was deleted from is the root node. If it is, we delete the root node
Otherwise we balance the tree, making sure the number of keys in each node is valid by borrowing and moving keys around the tree
This is done by using a recursive function till root (root need can have lesser than minimum number of keys)
4.2.5. Implementation of findRange:
Case 1: begin and end arguments is equal
Searching for records with key equal to value of begin
Case 2: start and end arguments are not equal and begin < end
Searching for records with key in the range of begin < key < end
We first start by traversing down the B+ Tree to the leaf nodes by taking the pointer that corresponds to the key (begin value < key value in node)
Once at leaf node, we scan through the leaf nodes until we find key equals to the value of end
While traversing, we count the number of leafNodes accessed, as well as the number of internalNodes accessed
Returns a vector of block pointers accessed whilst traversing the B+ Tree to find the records