Database Query Plan Interfacing Project

Real-world users may write SQL queries to search relational databases for different tasks. The RDBMS query optimizer will execute a query execution plan (QEP) to process each query, which is chosen from a large number of alternative query plans (AQP). Typically, the plan selected as the QEP is estimated to have least/lower cost than other AQP. One can retrieve and view these plans in visual (tree-structured view) or text format (e.g., json, XML) in an off-the-shelf DBMS software (e.g., PostgreSQL). In particular, PostgreSQL allows one to retrieve AQPs containing specific operators using the planner method configuration (https://www.postgresql.org/docs/9.2/runtime-configquery.html#RUNTIME-CONFIG-QUERY-CONSTANTS). Unfortunately, an SQL query and its query plan-related information are disconnected. In this project, the broad goal is to integrate them by retrieving relevant information from a QEP and AQP to annotate the corresponding SQL query to explain how different components of the query are executed by the underlying query processor and why the operators are chosen among other alternatives. Note that to explain the why question, it is important to retrieve representative AQPs associated with a SQL query.

Approach

1.1 Data Pre-processing & Database Setup

.1.1 Data Pre-processing

In order to make the data given to us usable for the database, we used the file preprocessing.py, which cleans the data for us, making it usable for the queries.

The .tbl files of the TPC-H dataset are converted to .csv files for easy insertion into the PostgreSQL database, this is performed by the tbl_to_csv function within preprocessing.py. The last character of each line is removed in order to parse the database correctly into .csv format. The .csv data is then inserted into the PostgreSQL database through COPY commands from a SQL script.

.1.2 Database Setup

If you already have an existing PostgreSQL database set up and want to connect it to this application, replace the parameters (host, database, user, password, port) in the get_json method inside interface.py to match the database parameters of your local database.

Figure 1 - get_json method for Database Connection

Alternatively, you can follow the instructions (db/README.md or 6.3 - Setting up the TPC-H database) to set up the TPC-H dataset and PostgreSQL database using Docker without any additional configurations.

.2 Obtaining Query Execution Plan (QEP)

We obtain the QEP from PostgreSQL 2 commands, EXPLAIN and ANALYZE. Running “EXPLAIN ANALYZE” will return the QEP in a JSON format which we can then use for further processing.

The EXPLAIN command will show the execution plan of a statement.

The ANALYZE command will carry out the command and show actual run times and other statistics.

Design of annotation.py

2.1 Functions:

generateTree(plans, parent=None)

takes in the json file of the input query

Loops through each plan in the json file and recursively retrieves all the child nodes or a previous node

First plan is the root of the tree

That is why the function is first called without parent argument

Recursive function calls has parent in its argument as the previous node

searchTree(token, root)

Does a search through built-in pre-order iteration of the tree

Used in the buildRelation() function

searchQuery(search_value, tokens, formatted_query)

Does a full text search on query

Returns a list of index tuple of matched query tokens or None if no token matched

Used in buildRelation() function

tokenizeQuery(query_formatted)

Takes in a formatted query obtained from the the sqlparse result of the formatted JSON from the EXPLAIN ANALYZE output

Returns dictionary of tokens to be used in the buildRelation function

buildRelation(query, tree)

Matches node in the tree to tokens from the tokenized query

Returns a dictionary which is used in interface.py to highlight nodes when they are hovered

Execution Process (GUI, project.py & interface.py)

3.1 Graphical User Interface:

The graphical interface is built using Tkinter from the Python standard library. The user

interface is built to be as intuitive as possible with minimal functionalities so as to not

confuse the user.

There are two main interfaces in this project, the query input interface and the results interface, stored in the main function of project.py and Frames functions of interface.py respectively.

3.2 Query Input interface (project.py)

Upon running the program, the user will first see the Welcome Screen (Figure 2), with a space to enter a SQL query input for analyzing.

Figure 2 - “Welcome Screen”

This is achieved using various tkinter functions, to control the fonts (family, size, weight, color) and background/border colors. We also controlled the default size of the interface, directions the user can scroll, as well as placeholder input, the sizes/arrangement, etc. To improve the dynamicity of the Execute button, we added a custom-written Shadow function that highlights the button when toggled or pressed. The implementation can be seen in Figure 3.

Figure 3 - Code for “Welcome Screen”

3.3 Results Interface (interface.py)

After the user submits a query, a tree-structured view is generated. The tree is generated using the json file retrieved from the database. The root of the tree is the first plan. The function “generateTree” is called recursively, retrieving the child nodes of the current parent node. Hovering over certain nodes causes the relevant components to the query to be highlighted. Clicking on the nodes will show details regarding that particular action including the total time taken so far. Figure 4 shows an example query, both the input and result.

Figure 4 - “Results Interface” (Highlights and details are that of SORT)

Limitations

The project works by obtaining the query plans from the EXPLAIN and ANALYZE commands provided by PostgreSQL. As a query can have many different query plans, it may be the case that the obtained query plans may not be the most optimal ones possible. In query processing, there are two steps, first, the selection of the optimal query plan, and the execution of the query. Though our software may not be selecting the most optimal query plan, it is able to select and execute a query plan in a reasonable amount of time, we feel that this tradeoff is agreeable as there may be many permutations of a query plan, and selecting the most optimal of them all may not be the most efficient as it can actually increase the time for query execution.



Back to Projects