MH6301 Information Retrieval and Analysis Assignment

Assignment Help Management Information Sys

Reference no: EM132476241

MH6301 Information Retrieval and Analysis - Nanyang Technological University

Question 1. (a) You are given a collection of 3 documents, listed below. You are asked to create an inverted index for this collection of documents.

D1: SPMS offers Master of Science D2: SPMS offers many courses
D3: MH6301 is a master course

(i) Suppose a white space tokenizer is used to identify the tokens. Briefly describe THREE (3) types of processing that can be applied to the tokens before creating the index.

(ii) Write down the resultant documents after applying the three type of processing in Q1(a)(i).

(iii) Draw the inverted index that would be built for the three documents, after applying the three types of processing on their tokens.

(b) Briefly describe the following concepts.

• Information need
• Query
• Document
• Relevant document
• TFIDF
• Edit distance
• A/B Testing

Term	e	K	m	s
Postings size	313	27	107	271

(i) Recommend a query processing order for a Boolean query:

(m OR s) AND (k OR e).

(ii) Estimate the minimum and maximum possible number of results for a Boolean query:

(e OR k) AND (NOT m)

(d) Discuss TWO (2) techniques to process phrase query like "information retrieval".

Question 2. (a) In ordinary English text, the average number of characters per word is
4.5. After indexing, the average length of a dictionary word is 8 characters. Suppose an index has a dictionary with 200,000 words. Assume each character occupies one byte.

(i) Explain why the average number of characters per word in ordinary English text is smaller than that in index dictionary.

(ii) With index compression, the compressed index stores dictionary as a string with block size of 4. Estimate the space usage in number of bytes of this dictionary.

(b) A social media company runs a platform that is similar to Facebook or Twitter. Users can post freestyle messages online and their followers or friends can view the messages immediately. However, users are not allowed to comment on messages. The company has consulted you to develop a search engine to index and search for the user posted messages.

(i) With reference to the messages posted on either Facebook or Twitter, what are the potential challenges in developing this search engine?

(ii) The company explicitly requests for fast response from the search engine and accepts relatively lower quality top K results from the search engine. Briefly describe THREE (3) techniques that can be used to satisfy company's request.

(c) When a user issues a query "application from submission", a search engine returns: do you mean "application form submission"? Discuss how to detect such kind of errors in query and how to give suggestions of correction.

(d) Briefly describe "static summary" and "dynamic summary" in the context of Web search engine.

Question 3. (a) Suppose the four documents in Table Q3a belong to the data science category, and the three documents in Table Q3b belong to the machine learning category. Each document is treated as a training instance.

Table Q3a

ID	Document
1	Microsoft SQL server database
2	Data management
3	Oracle database
4	Data integration

Table Q3b

ID	Data
1	Machine learning
2	Neural network deep learning
3	Learning by rank

(i) Use words as features to build a multinomial Naïve Bayes Classifier.

(ii) Use the classifier to classify the following document:

Data learning

(b) Suppose we have the occurrence table of term "data" in Table Q3c. The null hypothesis is data and IT are independent with a 0.001 chance of being wrong. The critical value for .999 confidence (p = 0.001) is 10.83. Compute χ2 value, and determine whether data and IT are independent.

Table Q3c

	Term = data	Term ≠ data
Class = IT	100	500
Class ≠ IT	200	9000

Question No. 3

(d) Suppose we have a hierarchy of categories, and a set of questions where each question belongs to one or more leaf categories in the hierarchy. Discuss how to build a hierarchical classification model using the set of questions on the given hierarchy?

Question 4. (a) Describe the k-means algorithm, and discuss how to choose k seed clusters for k-means algorithm.

(b) Given a set of k clusters of documents returned by a clustering algorithm, give three approaches of generating clustering labels of each of the k clusters.

(d) Describe one application of learning for rank.

(f) Suppose that you are required to build a Search Engine for documents with geospatial information. How do you design index?

Reference no: EM132476241

Questions Cloud

Compare the financial results of the company : Compare the financial results of the Company between 2018 and 2019. What other information is needed to make an assessment of Telstra's performance?

How would you explain risks to a pregnant patient : Mayra mentioned pregnancy as a condition that predisposes to clots. What mechanisms are involved at putting these patients more at risk? They are at more risk.

What is benchmarked in the study : Explore your clinical site and relate one quality improvement (QI) study currently being analyzed. What is benchmarked in the study? What role does the nurse.

What is the amount of interest expense recorded by sandhill : Lease is properly classified as a capital lease, what is the amount of interest expense recorded by Sandhill, Inc. in the first year of the asset's life?

MH6301 Information Retrieval and Analysis Assignment : MH6301 Information Retrieval and Analysis Assignment Help and Solution, Nanyang Technological University - Assessment Writing Service

How is company performing : How is company performing? for ex is the company's revenue growing or decreasing? Are the company's margins increasing or decreasing?

Did the researchers measure the dependent variable : If you are using a survey or a measurement tool to measure the data for your research proposal, it must measure what you are researching.

Describe the concepts of technology application : Review the concepts of technology application as presented in the Resources. Reflect on how emerging technologies such as artificial intelligence may help.

What are the current assets : ABC Ltd. has a Current Ratio of 1.5: 1 and Net Current Assets of Rs. 5,00,000. What are the Current Assets? Rs. 5,00,000, Rs. 10,00,000, Rs. 15,00,000

User Account

All Pages