Reference no: EM132476241
MH6301 Information Retrieval and Analysis - Nanyang Technological University
Question 1. (a) You are given a collection of 3 documents, listed below. You are asked to create an inverted index for this collection of documents.
D1: SPMS offers Master of Science D2: SPMS offers many courses
D3: MH6301 is a master course
(i) Suppose a white space tokenizer is used to identify the tokens. Briefly describe THREE (3) types of processing that can be applied to the tokens before creating the index.
(ii) Write down the resultant documents after applying the three type of processing in Q1(a)(i).
(iii) Draw the inverted index that would be built for the three documents, after applying the three types of processing on their tokens.
(b) Briefly describe the following concepts.
• Information need
• Query
• Document
• Relevant document
• TFIDF
• Edit distance
• A/B Testing
(c) The table below gives the sizes of postings lists for tokens e, k, m, and s, respectively.
Term
|
e
|
K
|
m
|
s
|
Postings size
|
313
|
27
|
107
|
271
|
(i) Recommend a query processing order for a Boolean query:
(m OR s) AND (k OR e).
(ii) Estimate the minimum and maximum possible number of results for a Boolean query:
(e OR k) AND (NOT m)
(d) Discuss TWO (2) techniques to process phrase query like "information retrieval".
Question 2. (a) In ordinary English text, the average number of characters per word is
4.5. After indexing, the average length of a dictionary word is 8 characters. Suppose an index has a dictionary with 200,000 words. Assume each character occupies one byte.
(i) Explain why the average number of characters per word in ordinary English text is smaller than that in index dictionary.
(ii) With index compression, the compressed index stores dictionary as a string with block size of 4. Estimate the space usage in number of bytes of this dictionary.
(b) A social media company runs a platform that is similar to Facebook or Twitter. Users can post freestyle messages online and their followers or friends can view the messages immediately. However, users are not allowed to comment on messages. The company has consulted you to develop a search engine to index and search for the user posted messages.
(i) With reference to the messages posted on either Facebook or Twitter, what are the potential challenges in developing this search engine?
(ii) The company explicitly requests for fast response from the search engine and accepts relatively lower quality top K results from the search engine. Briefly describe THREE (3) techniques that can be used to satisfy company's request.
(c) When a user issues a query "application from submission", a search engine returns: do you mean "application form submission"? Discuss how to detect such kind of errors in query and how to give suggestions of correction.
(d) Briefly describe "static summary" and "dynamic summary" in the context of Web search engine.
Question 3. (a) Suppose the four documents in Table Q3a belong to the data science category, and the three documents in Table Q3b belong to the machine learning category. Each document is treated as a training instance.
Table Q3a
ID
|
Document
|
1
|
Microsoft SQL server database
|
2
|
Data management
|
3
|
Oracle database
|
4
|
Data integration
|
Table Q3b
ID
|
Data
|
1
|
Machine learning
|
2
|
Neural network deep learning
|
3
|
Learning by rank
|
(i) Use words as features to build a multinomial Naïve Bayes Classifier.
(ii) Use the classifier to classify the following document:
Data learning
(b) Suppose we have the occurrence table of term "data" in Table Q3c. The null hypothesis is data and IT are independent with a 0.001 chance of being wrong. The critical value for .999 confidence (p = 0.001) is 10.83. Compute χ2 value, and determine whether data and IT are independent.
Table Q3c
|
Term = data
|
Term ≠ data
|
Class = IT
|
100
|
500
|
Class ≠ IT
|
200
|
9000
|
(c) Describe the k Nearest Neighbors algorithm for text classification.
Question No. 3
(d) Suppose we have a hierarchy of categories, and a set of questions where each question belongs to one or more leaf categories in the hierarchy. Discuss how to build a hierarchical classification model using the set of questions on the given hierarchy?
Question 4. (a) Describe the k-means algorithm, and discuss how to choose k seed clusters for k-means algorithm.
(b) Given a set of k clusters of documents returned by a clustering algorithm, give three approaches of generating clustering labels of each of the k clusters.
(c) What is the difference between topic specific pagerank and general pagerank?
(d) Describe one application of learning for rank.
(f) Suppose that you are required to build a Search Engine for documents with geospatial information. How do you design index?