ECS 165B Spring 2010 - Database System Implementation

DavisDB Part 2: Indexing Component (due 5/2 at 11:59pm)

Introduction

The second component of DavisDB that you will implement is the Indexing component. This component provides classes and methods for managing persistent B+ tree indices over unordered data records stored in paged files. Each data file may have a number of (single-attribute) indices associated with it. The indices ultimately will be used to speed up processing of relational queries and update operations. Like the data records themselves, the indices are stored in paged files. Hence, in implementing the Indexing component you will use the Page File component similarly to the way you used it for Part 1. In the architecture of the DavisDB system, you can think of the Indexing and Record Manager components as sitting side by side above the Page File component. The indexing technique you will implement in the Indexing component is that of B+ trees. B+ trees have already been reviewed in class and are discussed in detail in your textbook. To make the assignment more reasonable, we are allowing some simplications to the standard B+ tree implementation; see the implementation guidelines section below.

Indexing Component C++ Interface

The Indexing component interface you will implement consists of three classes: the IndexManager class, the IndexHandle class, and the IndexScan class. Header files and skeleton class implementations will be added to your repositories for you.

As in Part 1, the online documentation describes the public methods and roles of each class in detail; please read it carefully! (Start with IndexManager.) The documentation describes what you need to implement for each class. You must not change any of the definitions of the public methods we have provided, as this will break our automated tests. However, you are free to add new methods to the classes (public or private) as you see fit.

Guidelines and Suggestions

As was the case in Part 1, you must not modify the public methods of the classes that we have provided, but you may extend the interfaces by adding new public methods. The notes from lecture, and the information in your textbook, should give you plenty of background information on the fundamentals of B+ trees. Here we mention just some of the specific guidelines and constraints wrt the DavisDB implementation.

Header page. As in Part 1, your index will be constructed using an underlying page files, and as was the case with record files, each index file will likely need a header page, on which you store information about the file as a whole and information needed to find the other pages of the index.

Nodes and pages. The order of the B+ tree is not specified a priori and is not exposed through the IndexManager interface. However, for efficiency, you will probably want to store one node per page, with the order of the tree determined by the size of the attribute being indexed and PF_PAGE_SIZE.

Store record ids, not records. You should not store the data of the record itself in the B+ tree leaves, only pointers to the records (i.e., record ids). How exactly this is done is up to you. Recall from lecture that the two basic alternatives are to store either (a) a (key, recordId) pair for each entry in the tree, or (b) a (key, recordIdList) pair for each key in the tree. The first option is more straightforward to implement, but the second is more space-efficient when duplicate keys are common.

Duplicates. You need to consider the possibility of many entries in the index having the same key. Your implementation should be able to handle these cases gracefully, even if there are so many entries with a given key that the record ids do not even fit on the same page. This has ramifications for both schemes (a) and (b) discussed above: for scheme (a), when a split is performed it may result in two leaf pages that do not have a "separating key", so you will need to use, e.g., a special key value indicating this; and for scheme (b), you will need to use overflow pages when the list of record ids gets too long. One way to avoid these headaches is to use attribute value, record id pairs as keys in the tree, rather than attribute value alone: by doing this, all entries in the tree will be guaranteed to have distinct keys, and schemes (a) and (b) become the same scheme. This also makes deleting entries from the tree faster (as lists of duplicates never need to be scanned). However, it also carries significant space overhead, hence will probably hurt your I/O performance in the bigger picture.

Deletions. Since implementing the textbook B+ tree deletion algorithm is rather tricky, you may implement with no loss of credit a simpler scheme involving tombstones. In this approach, when an entry is deleted it is replaced by a special marker indicating an empty slot (which may be reused later). Internal tree nodes are never deleted or merged. Extra credit: implement the textbook B+ tree deletion algorithm, using merging and redistribution. (Warning: we recommend you get things working with tombstones first even if you plan to tackle the textbook deletion algorithm.)

Updates during scans. Index scans will be used by higher-level components when executing selection, join, and delete operations, as well as update operations on attributes other than the indexed attribute. Thus, each index entry scanned will either be used to fetch (and possibly update) a record, or to delete a record. While making an index scan work correctly for selection, join, and non-index-key update operations is relatively straightforward, deletion operations are more complicated even when using tombstones. You must ensure that it is possible to use an index scan to find and then delete all index entries satisfying a condition.

You may assume that during a retrieval scan (for selection, join, and update operations) no index entries will be inserted or deleted. You also may assume that during a deletion scan, no other index records will be inserted or deleted, and no other scans will be underway.

Order of records returned by scans. Depending on your design, it may require some extra effort for scans to always return record ids such that the corresponding attribute values are in (nondecreasing) order. This property is not required in the project; however, you may find it useful later on in Part 4, e.g., if you decide to use index scans to produce a sorted relation.

Logistics

You will continue working on the same codebase used in Part 1 of the project, using the same tools (cmake, suberversion, submit.sh, etc) described in the Record Manager logistics section. The new files needed for this part of the project - IndexManager.h, IndexHandle.h, and IndexScan.h - will be placed in your team's subversion repository, along with skeleton implementations of the classes in the appropriate cpp files. You will need to update your CMakeLists.txt to reflect the addition of these new files.

Documentation

Again as in Part 1, you will produce a short (1-2 page maximum), high-level description of your design in writeup.txt. (You can just reuse the same file from Part 1.) Recall that the template file also asks you to answer several other questions.

Before you begin coding, please answer the first question in the text file, which asks you to estimate the total number of person-hours you think will be needed to implement your solution (including testing).

Before submitting your code using submit.sh, please answer the the second question, which asks how many person-hours of effort were actually required.

Grading

As stated on the main project page, the standard formula for grading project components is 80% for correctness (as verified manually and by automated testing), and 20% for the writeup and for code design and readability.

Testing and Submission

We will provide a basic set of tests in the file TestIX.cpp. As was the case with Part 1, these are just meant to get you started, and an important part of the project involves developing your own thorough tests. The grading will be done using a more comprehensive version of these basic tests. Beware, we will try hard to break your code!

Details about obtaining the code distribution, using your team's subversion repository, and submitting your code are on the main project page.

Acknowledgments

Major aspects of the DavisDB project are derived from the RedBase project developed by Jennifer Widom for use in CS 346 at Stanford, and used here with her permission.