ECS 165B Spring 2011 - DavisDB Project

ECS 165B Spring 2011 - Database System Implementation

DavisDB Part 4: Query Engine Component (due Sunday, 5/29 at 11:59pm)

Introduction

The fourth and final component of DavisDB that you will implement is the Query Engine component. This component implements support for executing query and update commands expressed in a fragment of SQL that includes basic select-from-where queries, as well as insert, delete, and update commands. As in Part 3, these capabilities are exposed via the provided front-end command line shell interface implemented by SystemParser. Your task in this part of the project is to implement the backing functionality, which is contained in a single class called QueryEngine. The implementation of this component will use classes and methods from the earlier parts of the project: RecordManager, IndexManager, and SystemManager. Below, we discuss the syntax of the new commands, along with the corresponding QueryEngine methods.

Additions to the Command-Line Shell

The capabilities you will implement in QueryEngine are exposed via command-line shell, implemented in the provided SystemParser. An updated version of SystemParser supporting the query engine commands described below will be added to your repositories, along with several other files mentioned in the the logistics section. Invoking the "help;" command on the updated version of the parser displays usage information incorporating the four new commands:

select <attrName>, ... , <attrName>
       from <relName>, ..., <relName>
       [ where <attrName> <cmpOp> <attrOrValue> and ... 
         and <attrName> <cmpOp> <attrOrValue> ] ;
insert into <relName> values (<value>, ..., <value>) ;
delete from <relName> [ where <attrName> <cmpOp> <attrOrValue> and
                       ... and <attrName> <cmpOp> <attrOrValue>] ;
update <relName> set <attrName> = <attrOrValue>
       [ where <attrName> <cmpOp> <attrOrValue> and ... 
         and <attrName> <cmpOp> <attrOrValue> ] ;

The parser will invoke corresponding methods on QueryEngine to actually carry out the commands. Let's examine each of the four commands in detail.

The Select Command

The syntax of the one data retrieval command in DavisDB is:

  select <attrName>, ... , <attrName>
  from <relName>, ..., <relName>
[ where <attrName> <cmpOp> <attrOrValue> and ... and <attrName> <cmpOp> <attrOrValue> ] ;

where:

Each <attrName> in the select clause is fully-qualified, i.e., it includes the relation name. (Hence you have to say, e.g., select R.A from ... rather than just select A from ....)
The list of relation names in the from clause does not contain any duplicates, as self-joins are not supported. (This must be checked by your QueryEngine implementation, see below.)
The where clause is optional, and each term in the where clause is of the form <attrName> <cmpOp> <attrOrValue> where
- <attrName> is a fully-qualified attribute name,
- <cmpOp> is one of =, <, >, <=, >=, or <>, and
- <attrOrValue> is either a fully-qualified attribute name or a constant value. Constant values may be either integers, floats (which must contain a decimal point to be recognized as floats by the parser), or strings enclosed in quotation marks.

The command has the standard SQL interpretation as described in lecture and presented in your textbook. When a select statement is entered, SystemParser will invoke QueryEngine::select to execute the query and print the query results to the console using SystemPrinter. See the Doxygen docs for QueryEngine for more details, and the suggestions section below for some guidelines and tips about how to implement this method.

The Insert Command

Tuples may be inserted into a relation one at a time (as opposed to bulk loading) using the following command:

insert into <relName> values (<value>, ..., <value>);

This command inserts into relation <relName> a new tuple with the specified attribute values. As in bulk loading, attribute values must appear in the same order as in the create table command that was executed for relation <relName>. Values are specified in the same way as in load files and the select command. String values can be of any length up to the length specified for the corresponding attribute. When an insert command is entered, SystemParser will invoke QueryEngine::insert to carry out the command.

The Delete Command

Tuples may be deleted from relations using the following command:

delete from <relName> [ where <attrName> <cmpOp> <attrOrValue> and
                       ... and <attrName> <cmpOp> <attrOrValue>] ;

This command deletes from relation <relName> all tuples satisfying the where clause. As in the select command, the conditions in the where clause compare attributes to constant values or to other attributes. In the where clause, all attribute names must be fully-qualified. Comparison operators and specification of values are the same as in the select command. If the where clause is omitted, all tuples in <relName> are deleted. When a delete command is entered, SystemParser will invoke QueryEngine::remove to carry out the command.

The Update Command

Tuples in relations may be updated using the following command:

update <relName> set <attrName> = <attrOrValue>
       [ where <attrName> <cmpOp> <attrOrValue> and ... 
         and <attrName> <cmpOp> <attrOrValue> ] ;

This command updates in relation <relName> all tuples satisfying the where clause. The where clause is exactly as in the select and delete commands described above. <attrOrValue> on the right-hand side of the equality operator is either an attribute or a constant value. An attribute may be specified as a fully-qualified attribute name, or a constant value is specified as in the select command. Each updated tuple is modified so that its attribute <attrName>gets the value <attrOrValue> (if <attrOrValue> is a constant), or the value of the updated tuple's <attrOrValue> attribute (if <attrOrValue> is an attribute). If the where clause is omitted, all tuples in <relName> are updated. When an update command is entered, SystemParser will invoke QueryEngine::update to carry out the command.

For a fuller description of each of the QueryEngine methods listed above, please read the online documentation carefully. As in Parts 1-3, you must not change any of the definitions of the public methods we have provided, as this will break our automated tests. However, you are free to add new methods to the classes (public or private) as you see fit.

Implementation Guidelines and Suggestions

Basic requirements. Your implementation of QueryEngine::select should build a physical query plan as discussed in lecture, then execute the plan. We strongly suggest you use a pipelined evaluation model, although you may use a temporary relation approach instead of you have valid reasons for doing so.

Regardless of other design decisons you make, we would like your query execution strategy to be able to take advantage of indexes whenever it is straightforward/sensible to do so. In particular:

Suppose the query includes a local selection condition of the form R.A=v, where v is a constant value. If relation R has an index on attribute A, then an IndexScan together with calls to RecordFileHandle::getRecord can be used to fetch the tuples of R, rather than using a RecordFileScan.
You may use a simple nested-loops algorithm for each join or cross-product. Suppose you are joining R with S based on attribute equality, suppose S has no local selection condition, and suppose S has an index on its join attribute A. Then you can make S the inner relation of the nested-loop join and use an index scan on S instead of a file scan. (You need not, however, attempt to reorder the relations in the query in order to make such index joins possible.)

Although the execution strategies for delete and update commands generally are simpler than for select commands, you should still build physical query plans for delete. You will probably find that some physical operators from your implementation of select can be reused. Query plans are not required for insert commands.

Printing query plans. Your system should provide a way to pretty-print physical query plans for select, delete, and update commands before the command is executed. You will probably want to render tree-structured plans using some form of indentation. Query plans should be printed only when the value of parameter queryplans is set to on. This can be controlled via another addition to SystemParser in the form of a generic command

set <param> = <value>;

The current value of a given parameter can be retrieved by calling SystemParser::getParam. For full credit in this component of the project, the TA must be able to easily understand your query execution strategy for any select, delete, or update command based upon the pretty-printed plan information. But again, please do not print plans unless the user has specified set queryplans=on;.

IQueryOperator and IScanOperator We are providing an architectural template for a query evaluation engine implementation in the form of a generic C++ interface for query operators, IQueryOperator, along with an extension IScanOperator intended for use with record file and index scans. We are also providing sample implementations of two operators, ProjectOperator and FileScanOperator. The skeleton code you will receive for QueryEngine demonstrates how to construct and execute pipelined execution plans based on this interface for projection-only queries. The use of IQueryOperator and/or IScanOperator is not required, and you are free to use or ignore these files and sample code (or rewrite them to suit your needs). Note that IScanOperator is provided in order to support a generic approach to implemented (index) nested loops joins, such that the join operator itself does not know or care whether an index is being used. See the notes for Lecture 22 and Lecture 23 for more information.

Extra Credit Opportunities

We are offering several opportunities for extra credit in this part of the project. It is recommended that you attempt to tackle these only after getting the basic project functionality working.

Small amount of extra credit. Improve on the "canonical execution plans" methodology described above by figuring out some heuristics and using them to produce more efficient plans for queries and updates. You must document what you have done in the writeup.
Large amount of extra credit. Implement a full-blown System R-style query optimizer, as described in lecture and in the textbook. There will be a variable amount of credit given for this option, depending on how much you do. This will involve considering different join orders and performing some sort of cost estimation to compare execution plans, using a dynamic programming strategy. Cost estimation will require at a minimum maintaining and using cardinality statistics for each relation.

Logistics and Testing

You will continue working on the same codebase used in Parts 1-3 of the project, using the same tools (cmake, suberversion, submit.sh, etc) described in the Record Manager logistics section. The new files needed for this part of the project - updated versions of SystemParser.{cpp|h}, the query engine class definition QueryEngine.h and its skeleton implementation QueryEngine.cpp, a sample script Queries.ddb, and the starter query evaluation engine implementation including files IQueryOperator.h, IScanOperator.h, ProjectOperator.{cpp|h}, and FileScanOperator.{cpp|h> - will be placed in your team's subversion repository. As usual, you will need to update your CMakeLists.txt to reflect the addition of these new files.

In addition to the source files, you will be provided with a sample DavisDB "script", Queries.ddb, that exercises the basic functionality of the query engine. This can be "run" using

./DavisDB < Queries.ddb

The script uses the sample data files from Part 3 (Sailors.csv, Boats.csv, and Reserves.csv). We will also provide more extensive tests (which will also be used for the DavisDB I/O contest) in a later update; stay tuned.

As usual, you will be expected to also create additional tests yourself, emphasizing corner cases and error handling.

Documentation

Again as in the earlier assignments, you will produce a short (1-2 page maximum), high-level description of your design and testing strategy in writeup.txt. Recall that the template file also asks you to answer two other questions.

Before you begin coding, please answer the first question in the text file, which asks you to estimate the total number of person-hours you think will be needed to implement your solution (including testing).

Before submitting your code using submit.sh 4, please answer the the second question, which asks how many person-hours of effort were actually required. Also, if you tackled any of the extra credit opportunities mentioned above, don't forget to describe what you did in the writeup. Be sure to give sufficient detail so that we can figure out how much credit to give you.

Submission and Grading

As stated on the main project page, the standard formula for grading project components is 80% for correctness (as verified manually and by automated testing), and 20% for the writeup and for code design and readability. For uniformity, please make sure you use SystemPrinter to print the contents of tables, as specified in the documentation for SystemManager.

Also, please remove or disable any printf debugging statements in your code before submitting, as these interfere with interpretation of the shell output. (At this point, you should be using the debugger nearly all of the time anyway!)

Acknowledgments

Major aspects of the DavisDB project are derived from the RedBase project developed by Jennifer Widom for use in CS 346 at Stanford, and used here with her permission.