ECS 165B Spring 2011 - Database System Implementation
DavisDB Part 4: Query Engine Component (due Sunday, 5/29 at 11:59pm)
Introduction
The fourth and final component of DavisDB that you will implement is
the Query Engine component. This component implements support
for executing query and update commands expressed in a fragment of SQL
that includes basic select-from-where queries, as well
as insert, delete, and update commands. As
in Part 3, these capabilities are exposed via the provided front-end command line
shell interface implemented by
SystemParser. Your
task in this part of the project is to implement the backing
functionality, which is contained in a single class
called QueryEngine.
The implementation of this component will use classes and methods from
the earlier parts of the project:
RecordManager,
IndexManager,
and SystemManager. Below, we
discuss the syntax of the new commands, along with the corresponding
QueryEngine methods.
Additions to the Command-Line Shell
The capabilities you will implement
in QueryEngine are
exposed via command-line shell, implemented in the
provided SystemParser.
An updated version
of SystemParser
supporting the query engine commands described below will be added to
your repositories, along with several other files mentioned in the
the logistics section. Invoking the
"help;" command on the updated version of the parser displays
usage information incorporating the four new commands:
select <attrName>, ... , <attrName>
from <relName>, ..., <relName>
[ where <attrName> <cmpOp> <attrOrValue> and ...
and <attrName> <cmpOp> <attrOrValue> ] ;
insert into <relName> values (<value>, ..., <value>) ;
delete from <relName> [ where <attrName> <cmpOp> <attrOrValue> and
... and <attrName> <cmpOp> <attrOrValue>] ;
update <relName> set <attrName> = <attrOrValue>
[ where <attrName> <cmpOp> <attrOrValue> and ...
and <attrName> <cmpOp> <attrOrValue> ] ;
The parser will invoke corresponding methods
on QueryEngine to
actually carry out the commands. Let's examine each of the four
commands in detail.
The Select Command
The syntax of the one data retrieval command in DavisDB is:
select <attrName>, ... , <attrName>
from <relName>, ..., <relName>
[ where <attrName> <cmpOp> <attrOrValue> and ... and <attrName> <cmpOp> <attrOrValue> ] ;
where:
- Each <attrName> in the select clause
is fully-qualified, i.e., it includes the relation name.
(Hence you have to say, e.g., select R.A from ... rather than just
select A from ....)
- The list of relation names in the from clause does not
contain any duplicates, as self-joins are not supported. (This must
be checked by your QueryEngine implementation, see below.)
- The where clause is optional, and each term in
the where clause is of the form <attrName>
<cmpOp> <attrOrValue> where
- <attrName> is a fully-qualified attribute
name,
- <cmpOp> is one
of =, <, >, <=, >=,
or <>, and
- <attrOrValue> is either a fully-qualified
attribute name or a constant value. Constant values may be either
integers, floats (which must contain a decimal point to be
recognized as floats by the parser), or strings enclosed in
quotation marks.
The command has the standard SQL interpretation as described in
lecture and presented in your textbook. When a select
statement is
entered, SystemParser
will
invoke QueryEngine::select
to execute the query and print the query results to the console
using SystemPrinter.
See the Doxygen docs
for QueryEngine for
more details, and the suggestions section
below for some guidelines and tips about how to implement this method.
The Insert Command
Tuples may be inserted into a relation one at a time (as opposed to
bulk loading) using the following command:
insert into <relName> values (<value>, ..., <value>);
This command inserts into relation <relName> a new
tuple with the specified attribute values. As in bulk loading,
attribute values must appear in the same order as in the create
table command that was executed for
relation <relName>. Values are specified in the same
way as in load files and the select command. String values can be of
any length up to the length specified for the corresponding
attribute. When an insert command is
entered, SystemParser
will
invoke QueryEngine::insert
to carry out the command.
The Delete Command
Tuples may be deleted from relations using the following command:
delete from <relName> [ where <attrName> <cmpOp> <attrOrValue> and
... and <attrName> <cmpOp> <attrOrValue>] ;
This command deletes from relation <relName> all tuples
satisfying the where clause. As in the select
command, the conditions in the where clause compare
attributes to constant values or to other attributes. In
the where clause, all attribute names must be
fully-qualified. Comparison operators and specification of values are
the same as in the select command. If the where
clause is omitted, all tuples in <relName> are deleted.
When a delete command is
entered, SystemParser
will
invoke QueryEngine::remove
to carry out the command.
The Update Command
Tuples in relations may be updated using the following command:
update <relName> set <attrName> = <attrOrValue>
[ where <attrName> <cmpOp> <attrOrValue> and ...
and <attrName> <cmpOp> <attrOrValue> ] ;
This command updates in relation <relName> all tuples satisfying the
where clause. The where clause is exactly as in
the select and delete commands described
above. <attrOrValue> on the right-hand side of the
equality operator is either an attribute or a constant value. An
attribute may be specified as a fully-qualified attribute name, or a
constant value is specified as in the select command. Each
updated tuple is modified so that its attribute
<attrName>gets the value <attrOrValue>
(if <attrOrValue> is a constant), or the value of the
updated tuple's <attrOrValue> attribute
(if <attrOrValue> is an attribute). If
the where clause is omitted, all tuples
in <relName> are updated. When an update
command is
entered, SystemParser
will
invoke QueryEngine::update
to carry out the command.
For a fuller description of each of
the QueryEngine methods
listed above, please read the online documentation carefully. As in
Parts 1-3, you must not change any of the definitions of the public
methods we have provided, as this will break our automated tests.
However, you are free to add new methods to the classes (public or
private) as you see fit.
Implementation Guidelines and Suggestions
Basic requirements. Your implementation
of QueryEngine::select
should build a physical query plan as discussed in lecture, then
execute the plan. We strongly suggest you use a pipelined evaluation
model, although you may use a temporary relation approach instead of
you have valid reasons for doing so.
Regardless of other design decisons you make, we would like your
query execution strategy to be able to take advantage of indexes
whenever it is straightforward/sensible to do so. In particular:
- Suppose the query includes a local selection condition of the
form R.A=v, where v is a constant value. If
relation R has an index on attribute A, then
an IndexScan together
with calls
to RecordFileHandle::getRecord
can be used to fetch the tuples of R, rather than using a
RecordFileScan.
-
You may use a simple nested-loops algorithm for each join or
cross-product. Suppose you are joining R with S based on attribute
equality, suppose S has no local selection condition, and suppose
S has an index on its join attribute A. Then you can
make S the inner relation of the nested-loop join and use
an index scan on S instead of a file scan. (You need not,
however, attempt to reorder the relations in the query in order to
make such index joins possible.)
Although the execution strategies for delete
and update commands generally are simpler than
for select commands, you should still build physical query
plans for delete. You will probably find that some physical
operators from your implementation of select can be reused.
Query plans are not required for insert commands.
Printing query plans. Your system should provide a way to
pretty-print physical query plans
for select, delete, and update commands
before the command is executed. You will probably want to render
tree-structured plans using some form of indentation. Query plans
should be printed only when the value of parameter queryplans
is set to on. This can be controlled via another addition
to SystemParser in the form of a generic command
set <param> = <value>;
The current value of a
given parameter can be retrieved by
calling SystemParser::getParam.
For full credit in this component of the project, the TA must be able
to easily understand your query execution strategy for
any select,
delete, or update command based upon the
pretty-printed plan information. But again, please do not print plans
unless the user has specified set queryplans=on;.
IQueryOperator and IScanOperator We are providing an architectural template
for a query evaluation engine implementation in the form of a generic
C++ interface for query
operators, IQueryOperator, along
with an extension IScanOperator intended
for use with record file and index scans. We are also providing
sample implementations of two
operators, ProjectOperator
and FileScanOperator.
The skeleton code you will receive
for QueryEngine
demonstrates how to construct and execute pipelined execution plans
based on this interface for projection-only queries. The use
of IQueryOperator
and/or IScanOperator
is not required, and you are free to use or ignore these files and
sample code (or rewrite them to suit your needs). Note that IScanOperator is
provided in order to support a generic approach to implemented (index)
nested loops joins, such that the join operator itself does not know
or care whether an index is being used. See the notes for Lecture 22 and Lecture 23 for more information.
Extra Credit Opportunities
We are offering several opportunities for extra credit in this part of
the project. It is recommended that you attempt to tackle these only
after getting the basic project functionality working.
- Small amount of extra credit. Improve on the "canonical
execution plans" methodology described above by figuring out
some heuristics and using them to produce more efficient
plans for queries and updates. You must document what you have done
in the writeup.
- Large amount of extra credit. Implement a full-blown
System R-style query optimizer, as described in lecture and in the
textbook. There will be a variable amount of credit given for this
option, depending on how much you do. This will involve considering
different join orders and performing some sort of cost estimation to
compare execution plans, using a dynamic programming strategy. Cost
estimation will require at a minimum maintaining and using
cardinality statistics for each relation.
Logistics and Testing
You will continue working on the same codebase used in Parts 1-3 of
the project, using the same tools (cmake, suberversion,
submit.sh, etc) described in the Record Manager logistics
section. The new files needed for this part of the project -
updated versions of SystemParser.{cpp|h}, the query engine
class definition QueryEngine.h and its skeleton
implementation QueryEngine.cpp, a sample script
Queries.ddb, and the starter query evaluation engine
implementation including files IQueryOperator.h,
IScanOperator.h, ProjectOperator.{cpp|h}, and
FileScanOperator.{cpp|h> - will be placed in your team's
subversion repository. As usual, you will need to update your
CMakeLists.txt to reflect the addition of these new files.
In addition to the source files, you will be provided with a
sample DavisDB "script", Queries.ddb, that exercises the
basic functionality of the query engine. This can be "run"
using ./DavisDB < Queries.ddb
The script uses the sample
data files from Part 3 (Sailors.csv, Boats.csv,
and Reserves.csv). We will also provide more extensive tests
(which will also be used for
the DavisDB I/O contest) in a later
update; stay tuned.
As usual, you will be expected to also create additional tests
yourself, emphasizing corner cases and error handling.
Documentation
Again as in the earlier assignments, you will produce a short (1-2
page maximum), high-level description of your design and testing
strategy in writeup.txt. Recall that the template file also
asks you to answer two other questions.
Before you begin coding, please answer the first question in
the text file, which asks you to estimate the total number of
person-hours you think will be needed to implement your solution
(including testing).
Before submitting your code using submit.sh 4,
please answer the the second question, which asks how many
person-hours of effort were actually required.
Also, if you tackled any of the extra credit opportunities mentioned
above, don't forget to describe what you did in the writeup. Be sure
to give sufficient detail so that we can figure out how much credit to
give you.
Submission and Grading
As stated on the main project page, the
standard formula for grading project components is 80% for correctness
(as verified manually and by automated testing), and 20% for the
writeup and for code design and readability. For uniformity, please
make sure you
use SystemPrinter to
print the contents of tables, as specified in the documentation
for SystemManager.
Also, please remove or disable any printf debugging
statements in your code before submitting, as these interfere
with interpretation of the shell output. (At this point, you should
be using the debugger nearly all of the time anyway!)
Acknowledgments
Major aspects of the DavisDB project are derived from
the RedBase
project developed
by Jennifer Widom for
use in CS 346
at Stanford, and used here with her permission.