Reference no: EM13910219
Suffix table
Please read the following text carefully and use it to reply to the six queries at the end of the problem.
Finding a given character string (a so-called pattern) in a longer character string (a text) is one of the basic applications of computer science. You can implement the search by going through the whole text from start to finish while checking whether the pattern you are searching for is present. This is a very slow way of doing it if the text is very large. You can make the search quicker by creating an index structure based on the text beforehand. With the help of the index structure, you can avoid going through the whole text. One simple and frequently used index structure is called a suffix table. This method has been used for efficient searching in large DNA data collections, for example, such as the human genome with some 3 billion characters.
A character string consists of characters following each other. You can refer to a character string with a character string variable. You can refer to certain characters in the string by giving the index of the character within square brackets following the character string variable. For instance, x[1] refers to the first character in the string x. The annotation x[i..j] refers to the partial character string in x that starts at index i and ends at index j. A partial character string that continues to the end of the character string is called a suffix. The starting index of the suffix is called the index of the suffix. We use <, ≤, =, ≥, > for the character strings x and y to compare their alphabetical order. For example, x ≤ y means that x is alphabetically smaller than or as large as y, and x = y means that x and y are the same character string.
Example 1.
If the character string x is "group" it is true that:
• x[1] = " g", x[2] = " r" and x[4] = " u".
• x[1..3] = " gro", x[3..3] = " o" and x[2..5] = " roup".
• The suffixes to the character string x given in order according to the suffix indexes, i.e. according to the start indexes 1, 2, 3, 4 and 5, are x[1..5] = " group", x[2..5] = " roup", x[3..5] = " oup", x[4..5] = " up" and x[5..5] = " p".
We refer to a text with the character string variable t and to the pattern we are looking for in the text with the string variable p. Furthermore, we assume that the length of the text t is n.
The suffix table S for text t is an integer table with n elements, which lists the indexes of the suffixes in the text in alphabetical order of the suffixes. The value of index i in table S, annotated as S[i], indicates from which position in the text the ith smallest index suffix starts: S[1] indicates the index of the smallest suffix (in alphabetical order) in the text, S[2] the index for the next suffix in alphabetical order, etc. You can also express this as follows: for index i = 2, . . . , n it is true that t[S[i - 1]..n] ≤ t[S[i]..n].
Example 2. The suffixes for text t = " abababba" are " abababba", " bababba", " ababba", " babba", " abba", " bba", " ba" and " a". Below to the left are the suf- fixes for text t in alphabetical order, and to the right is the suffix table S for text t. Please note that the values of the suffix table are the same as the suf- fix indexes to the left. For example, the value S[5] = 7 tells us that the fifth largest suffix alphabetically starts from index 7, i.e. t[7..8] = is " ba".
Suffix (alphabetical order)
a
|
Suffix index
8
|
Index i
1
|
S[i]
8
|
abababba
|
1
|
2
|
1
|
ababba
|
3
|
3
|
3
|
abba
|
5
|
4
|
5
|
ba
|
7
|
5
|
7
|
bababba
|
2
|
6
|
2
|
babba
|
4
|
7
|
4
|
bba
|
6
|
8
|
6
|
A basic suffix feature is that, if pattern p occurs anywhere in the text, then p occurs as the start of the suffix that starts at that point in the text. In that case, we say that pattern p matches the suffix in question. We can search for pattern p in the text by searching for a suffix that matches p. With the help of the suffix table, this search can be made efficient by using a so-called binary search.
The binary search maintains information on a suffix table interval that, in accordance with our current information, could contain a suffix that matches pattern p. We use the annotation start for the start index of the search interval, and end for the end index. Further, we will determine the middle index middle with the formula middle = (start + end)/2, which we will round upwards if the sum of (start + end) is uneven. At first, all the suffixes are possible, so start = 1 and end = n.
The binary search compares the pattern and suffix in the middle of the search interval t[S[middle]..n] to each other. If the pattern matches, the search can end1 . Otherwise, either p < t[S[middle]..n] or p > t[S[middle]..n] is true.
If p < t[S[middle]..n], i.e. the pattern is alphabetically smaller than the suffix in index S[middle], none of the suffixes in interval middle, . . . , end can match the pattern. This follows directly from the suffix table containing the suffixes in alphabetical order. In that case, the binary search updates the upper bound of the search interval to end = middle - 1 and compares the pattern to the suffixes in the middle of the new search interval during the next search round.
In the same way, if p > t[S[middle]..n], we can say that none of the suffixes in the interval start, . . . , middle can match the pattern. Then we can update the lower bound to start = middle + 1.
If the search interval remains empty during the search, i.e. the criterion in start > end is met, the search ends with no result: the text does not contain pattern p. Below we see the search for pattern p in text t with the help of suffix table S in more detail, step by step:
1. Set for the search interval start index start = 1 and end index end = n.
2. If start > end i.e. the search interval is empty, end the search: pattern p does not occur in the text.
3. Calculate the middle index middle between start and end index: middle = (start + end)/2, round off upwards if necessary.
4. Compare pattern p with middle suffix t[S[middle]..n] of the interval.
• If p matches the start of t[S[middle]..n], end the search with the information that pattern p was found starting at index S[middle] of the text.
• If p does not match the middle suffix t[S[middle]..n] of the interval, then:
- If p < t[S[middle]..n], update end = middle - 1 and continue searching by going back to .
- If p > t[S[middle]..n], update start = middle + 1 and continue the search by going back to 4.
Queries
Query 1. Give the steps a binary search makes when searching for the pattern p = " babaa" in the text in the example 3 t = " abbaababbababbab". Please write your answer in the same format as the example 3, i.e. for each step, write the search interval values start, end and middle.
Query 2. Write the suffix table for the character string t = " assignment".
Query 3.
a) Write the suffix table for the character string t = " aacatcgatagctagaacat".
b) Write the steps a binary search takes when searching for the pattern p = " cga" in the text in a). Please write your answer in the same format as the example 3, i.e. for each step, write the search interval values start, end and middle.
Query 4. Please write a character string consisting of characters in the English alphabet with a suffix table like the one below. This means that characters in the alphabet" a", " b", " c", . . ., " z" are acceptable.
i
|
S[i]
|
1
|
8
|
2
|
7
|
3
|
6
|
4
|
5
|
5
|
4
|
6
|
3
|
7
|
2
|
8
|
1
|
Query 5. Write a character string that contains only the characters " a" and" b" with a suffix table like the one below.
i
|
S[i]
|
1
|
16
|
2
|
13
|
3
|
3
|
4
|
14
|
5
|
11
|
6
|
9
|
7
|
4
|
8
|
6
|
9
|
15
|
10
|
12
|
11
|
2
|
12
|
10
|
13
|
8
|
14
|
5
|
15
|
1
|
16
|
7
|
Query 6. Program the algorithm given on page 3. Test your program using the example 3 and queries 2-5.