Create a compressed file that contains the eclipse project directory and a short document that describes:
- the methods implemented
- any missing or incomplete elements of the code
- any changes made to the original API
- the amount of time spent for this assignment
- suggestions or comments if any
In this programming assignment, you need to extend a Java-based database management system called SimpleDB by implementing the core modules required to access stored data on disk. We have provided you with a set of unimplemented classes and interfaces (see those in the source code that have methods currently throwing UnsupportedOperationExceptions). You will need to write the code for these classes. Your code will be graded by running a set of system tests written using JUnit.
Details of JUnit can be found at https://junit.sourceforge.net/.
You ?rstly need to install/run Eclipse on your machine and import the "simpledb-1" project (see Appendix A). This assignment provides an eclipse project that contains source code and test suites.
Refer to Appendix B for the details of running the test suites.
The remainder of this document describes the components you need to implement. The architecture of SimpleDB is described in Appendix C.
Part 1. Fields and Tuples
Tuples in SimpleDB are relatively simple. They consist of a collection of Field objects, one per field in the Tuple. Field is an interface that different data types (e.g., integer, string) implement. Tuple objects are created by the underlying access methods (e.g., heap files, or B-trees), as described later. Tuples also have a type (or schema), called a tuple descriptor, represented by a TupleDesc object. This object consists of a collection of Type objects, one per field in the tuple, each of which describes the type of the corresponding ?eld.
In this part, you need to implement one method of the Tuple class and five methods of the TupleDesc class that throw UnsupportedOperationExceptions. Once you have completed these classes, your code will pass the unit tests TupleTest and TupleDescTest.
Part 2. Catalog
The catalog in SimpleDB consists of a list of the tables and schemas of the tables that are currently in the database. You need to implement two methods of the Catalog class that currently throw UnsupportedOperationExceptions. Associated with each table is a TupleDesc object that allows operators to determine the types and number of fields in a table.
The global catalog is a single instance of Catalog that is allocated for the entire SimpleDB process.
The global catalog can be retrieved via the method Database.getCatalog(), and the same goes for the global buffer pool (using Database.getBufferPool ()).
Once you have completed this part, your code will pass the unit tests in CatalogTest.
Part 3. BufferPool
The buffer pool is responsible for caching pages in memory that have been recently read from disk. All operators read and write pages from various files on disk through the bu?er pool. The buffer pool consists of a fixed number of pages, defined by the numPages parameter to the BufferPool constructor. For this assignment, you need to implement the BufferPool .getPage() method used by the SeqScan operator. The BufferPool should store up to numPages pages. In this assignment, if more than numPages requests are made for different pages, then instead of implementing an eviction policy, a DbException must be thrown. The Database class provides a static method, Database.getBufferPool (), that returns a reference to the single BufferPool instance for the entire SimpleDB process. We have not provided unit tests for BufferPool . The functionality you implemented will be tested in the implementation of HeapFile below. You should use the DbFile.readPage method to access pages of a DbFile.
Part 4. HeapPage access method
Access methods provide a way to read or write data from disk that is arranged in a specific way. Common access methods include heap files (unsorted files of tuples) and B-trees. For this assignment, you need to implement only the heap file access method, and we have written some of the code for you. A HeapFile object is arranged into a set of pages, each of which consists of a ?xed number of bytes for storing tuples, (defined by the constant BufferPool .PAGE SIZE), plus a header. In SimpleDB, there is one HeapFile object for each table in the database. Each page in a HeapFile is arranged as a set of slots, each of which can hold one tuple (tuples for a given table in SimpleDB are all of the same size). In addition to these slots, each page has a header that consists of a bitmap with one bit per tuple slot. If the bit corresponding to a particular tuple is 1, it indicates that the tuple is valid; if it is 0, the tuple is invalid (e.g., has been deleted or was never initialized.) Pages of HeapFile objects are of type HeapPage which implements the Page interface. Pages are stored in the buffer pool but are read and written by the HeapFile class. SimpleDB stores heap files on disk in more or less the same format they are stored in memory. Each file consists of page data arranged consecutively on disk. Each page consists of one or more 32-bit integers representing the header, followed by the BufferPool . PAGE SIZE bytes of actual page content. The number of 32-bit integers in the header is defined by the formula: ((BufferPool .PAGE SIZE / tuple size) / 32 ) +1 ), where tuple size is the size of a tuple in the page in bytes. The low (least signi?cant) bits of each integer represent the status of the slots that are earlier in the file. Hence, the lowest bit of the first integer represents whether or not the ?rst slot in the page is in use. Also, note that the high-order bits of the last such integer may not correspond to a slot that is actually in the file, since the number of slots may not be a multiple of 32. Also note that all Java virtual machines are big-endian. The page content of each page consists of floor (BufferPool .PAGE SIZE/tuple size) tuple slots. In this part, you need to implement the iterator method of the HeapPage class that iterates over the tuples in the page, which may involve an auxiliary class or data structure. At this point, your code should pass the unit tests in HeapPageReadTest.
Part 5. HeapFile access method
After you have implemented HeapPage, you will write methods for HeapFile in this assignment to read a page from the file and to fetch tuples from a file stored on disk. In particular, you need to implement the HeapFile.iterator() method, which should iterate through the tuples of each page
in the HeapFile. The iterator must use the BufferPool.getPage () method to access pages in the HeapFile.
At this point, your code should pass the unit tests in HeapFileReadTest and ScanTest.
Final Remark
Operators are responsible for the actual execution of the query plan. They implement the operations of the relational algebra. In SimpleDB, operators are iterator based; each operator implements the DbIterator interface. Operators are connected together into a plan by passing lower-level operators into the constructors of higher-level operators, i.e., by 'chaining them together.' Special access method operators at the leaves of the plan are responsible for reading data from the disk (and hence do not have any operators below them).
At the top of the plan, the program interacting with SimpleDB simply calls getNext on the root operator. This operator then calls getNext on its children, and so on, until these leaf operators are called. They fetch tuples from disk and pass them up the tree (as return arguments to getNext). Tuples propagate up the plan in this way until they are output at the root or combined or rejected by another operator in the plan.
In this assignment, we use src/simpledb/SeqScan.java, which is already implemented. This operator sequentially scans all of the tuples from the pages of the table speci?ed by the tableid in the constructor. This operator accesses tuples through the DbFile.iterator() method.
Now, let's see how the above components are connected together to process a simple query. The following code implements a simple selection query over a data file consisting of three columns of integers. In the code, the file some data file.dat is a binary representation of the pages from this ?le.
This code is equivalent to the SQL statement SELECT * FROM some data ?le.
// construct a 3-column table schema Type types[] = new Type[]{ Type.INT_TYPE,
Type.INT_TYPE, Type.INT_TYPE }; String names[] = new String[]{ "field0", "field1",
"field2" }; TupleDesc descriptor = new TupleDesc(types, names);
// create the table, associate it with some_data_file.dat // and tell the catalog about the schema of this table. HeapFile table1 = new
HeapFile(new File("some_data_file.dat"));
Database.getCatalog().addTable(table1, descriptor);
// construct the query: we use a simple SeqScan, which spoonfeeds // tuples via its iterator. TransactionId tid = new TransactionId(); SeqScan f = new
SeqScan(tid, table1.id());
// and run it
f.open();
while (f.hasNext()) {
Tuple tup = f.next();
System.out.println(tup);
}
f.close();
Database.getBufferPool().transactionComplete(tid);
The above code creates a table that has three integer fields. To express this, we create a TupleDesc object and pass it an array of Type objects, and optionally an array of String ?eld names. Once we have created this TupleDesc, we initialize a HeapFile object representing the table stored in some data file.dat. Once we have created the table, we add it to the catalog.
Once we have finished initializing the database system, we create a query plan. Our plan consists only of the SeqScan operator that scans the tuples from disk. In general, these operators are instantiated with references to the appropriate table (in the case of SeqScan) or child operator (e.g., in the case of Filter). The test program then repeatedly calls hasNext and next on the SeqScan operator. As tuples are output from the SeqScan, they are printed out on the command line.
This is all! I hope this assignment will help you understand how real database management systems would work.