Reference no: EM133753833
Algorithms and Data Structures
Purpose
The purpose of this assignment is for you to:
Improve your proficiency in C programming and your dexterity with dynamic memory allocation.
Demonstrate understanding of a concrete data structure (radix tree) and implement a set of algorithms.
Practice multi-file programming and improve your proficiency in using UNIX utilities.
Practice writing reports which systematically compare the efficiency of different data structures.
Background
In Assignment 1 you implemented a dictionary using a linked list. Many applications of dictionaries cannot assume that the user will query in the correct format. Take Google search as an example. If someone is so excited to check the recent Olympics updates, they may accidentally search something similar to this:
Thankfully, there is enough context in the original query that the intended search result is recommended back to the user. In Assignment 2 you will be extending your implementation to account for misspelled keys when querying your dictionary.
Your Task
Assignment:
You will be using the same dataset as Assignment 1.
Users will be able to query the radix tree and will get either the expected key, or the closest recommended key.
You will then write a report to analyse the time and memory complexity of your Assignment 1 linked list compared to your radix tree implementation.
Implementation:
Your programs will build the dictionary by reading data from a file. They will insert each suburb into the dictionary (either the linked list (Stage 3) or radix tree (Stage 4)).
Your programs will handle the search for keys. There are three situations that your programs must handle:
Handle exact matches: output all records that match the key (Stage 3 and 4).
Handle similar matches: if there are no exact matches, find the most similar key and output its associated records (Stage 4 only).
Handle no matches being found: if neither exact nor similar matches are found, indicate that there is no match or recommendation (Stage 3 and 4).
Your programs should also be able to populate and query the dictionary with the Assignment 1 linked list approach and the radix tree approach.
Your programs should also provide enough information to compare the efficiency of the linked list with the radix tree with a operation-based measure that ignores less important run- time factors (e.g. comparison count).
An Introduction to Tries and Radix Trees
First, it is important to establish the difference between a Trie and a Tree. This is best illustrated with an example. One example of a tree is a binary search tree (BST), where each node in the tree stores an entire string, as illustrated below. The nodes are ordered and allow easy searching. When searching in the BST, we compare the query with the entire string at each node, deciding whether to switch to the left or right subtree or stop (if the subtree is empty) based on the result of the comparison.
A trie is slightly different. It has multiple names, such as retrieval tree or prefix tree. In a trie, the traversal of the tree determines the corresponding key. For the same strings as above with one letter per node, it would look like:
Tries allow for quick lookup of strings. Querying this trie with the key "hey" would find no valid path after the "e" node. Therefore, you can determine that the key "hey" does not exist in this trie.
A radix trie is a subtype of trie, sometimes called a compressed trie. Radix trees do not store a single letter at each node, but compress multiple letters into a single node to save space. At the character level, it would look like this:
Radix tries can again be broken down into different types depending on how many bits are used to define the branching factor. In this case, we are using a radix (r) of 2, which means every node has 2 children. This is accomplished by using 1 bit of precision, so each branch would be either a 0 or 1. This type of radix trie (with r=2) is called a Patricia trie. Another example of a radix trie with r=4 would have 4 children, determined by the binary numbers 00, 01, 10, 11. The radix is related to the binary precision by r = 2x where x is the number of bits used for branching.
Radix trees benefit from less memory usage and quicker search times when compared to a regular trie.
While these visual representations are at the character level, a Patricia trie is implemented using the binary of each character. Each node in the trie stores the binary prefix at the current node, rather than the character prefix. For example, we can insert 3 binary numbers into a PATRICIA trie: 0000, 0010 and 0011.
Combining the previous worded example with the binary representation gives us a patricia tree of the form:
You should trace along each path and validate that the stored strings are the same as the example above. Each character is represented over 8 bits, and the last character is followed by an end of string 00000000 8 bit character, i.e. NULL .
It is important to note that these representations only show the prefix at each node. An actual implementation will require more information within a node struct. To see this, look at the "extra insertion example" slide.
Implementation Tips
Get a particular bit from a given character
Extract a stem from a given key number of bits from a given key.