MH6301 Information Retrieval and Analysis Assignment

Assignment Help Management Information Sys
Reference no: EM132476241

MH6301 Information Retrieval and Analysis - Nanyang Technological University

Question 1. (a) You are given a collection of 3 documents, listed below. You are asked to create an inverted index for this collection of documents.

D1: SPMS offers Master of Science D2: SPMS offers many courses
D3: MH6301 is a master course

(i) Suppose a white space tokenizer is used to identify the tokens. Briefly describe THREE (3) types of processing that can be applied to the tokens before creating the index.

(ii) Write down the resultant documents after applying the three type of processing in Q1(a)(i).

(iii) Draw the inverted index that would be built for the three documents, after applying the three types of processing on their tokens.

(b) Briefly describe the following concepts.

• Information need
• Query
• Document
• Relevant document
• TFIDF
• Edit distance
• A/B Testing

(c) The table below gives the sizes of postings lists for tokens e, k, m, and s, respectively.

Term

e

K

m

s

Postings size

313

27

107

271

(i) Recommend a query processing order for a Boolean query:

(m OR s) AND (k OR e).

(ii) Estimate the minimum and maximum possible number of results for a Boolean query:

(e OR k) AND (NOT m)

(d) Discuss TWO (2) techniques to process phrase query like "information retrieval".

Question 2. (a) In ordinary English text, the average number of characters per word is
4.5. After indexing, the average length of a dictionary word is 8 characters. Suppose an index has a dictionary with 200,000 words. Assume each character occupies one byte.

(i) Explain why the average number of characters per word in ordinary English text is smaller than that in index dictionary.

(ii) With index compression, the compressed index stores dictionary as a string with block size of 4. Estimate the space usage in number of bytes of this dictionary.

(b) A social media company runs a platform that is similar to Facebook or Twitter. Users can post freestyle messages online and their followers or friends can view the messages immediately. However, users are not allowed to comment on messages. The company has consulted you to develop a search engine to index and search for the user posted messages.

(i) With reference to the messages posted on either Facebook or Twitter, what are the potential challenges in developing this search engine?

(ii) The company explicitly requests for fast response from the search engine and accepts relatively lower quality top K results from the search engine. Briefly describe THREE (3) techniques that can be used to satisfy company's request.

(c) When a user issues a query "application from submission", a search engine returns: do you mean "application form submission"? Discuss how to detect such kind of errors in query and how to give suggestions of correction.

(d) Briefly describe "static summary" and "dynamic summary" in the context of Web search engine.

Question 3. (a) Suppose the four documents in Table Q3a belong to the data science category, and the three documents in Table Q3b belong to the machine learning category. Each document is treated as a training instance.

Table Q3a

ID

Document

1

Microsoft SQL server database

2

Data management

3

Oracle database

4

Data integration

Table Q3b

ID

Data

1

Machine learning

2

Neural network deep learning

3

Learning by rank

(i) Use words as features to build a multinomial Naïve Bayes Classifier.

(ii) Use the classifier to classify the following document:

Data learning

(b) Suppose we have the occurrence table of term "data" in Table Q3c. The null hypothesis is data and IT are independent with a 0.001 chance of being wrong. The critical value for .999 confidence (p = 0.001) is 10.83. Compute χ2 value, and determine whether data and IT are independent.

Table Q3c

 

Term = data

Term data

Class = IT

100

500

Class ≠ IT

200

9000

(c) Describe the k Nearest Neighbors algorithm for text classification.

Question No. 3

(d) Suppose we have a hierarchy of categories, and a set of questions where each question belongs to one or more leaf categories in the hierarchy. Discuss how to build a hierarchical classification model using the set of questions on the given hierarchy?

Question 4. (a) Describe the k-means algorithm, and discuss how to choose k seed clusters for k-means algorithm.

(b) Given a set of k clusters of documents returned by a clustering algorithm, give three approaches of generating clustering labels of each of the k clusters.

(c) What is the difference between topic specific pagerank and general pagerank?

(d) Describe one application of learning for rank.

(f) Suppose that you are required to build a Search Engine for documents with geospatial information. How do you design index?

Reference no: EM132476241

Questions Cloud

Compare the financial results of the company : Compare the financial results of the Company between 2018 and 2019. What other information is needed to make an assessment of Telstra's performance?
How would you explain risks to a pregnant patient : Mayra mentioned pregnancy as a condition that predisposes to clots. What mechanisms are involved at putting these patients more at risk? They are at more risk.
What is benchmarked in the study : Explore your clinical site and relate one quality improvement (QI) study currently being analyzed. What is benchmarked in the study? What role does the nurse.
What is the amount of interest expense recorded by sandhill : Lease is properly classified as a capital lease, what is the amount of interest expense recorded by Sandhill, Inc. in the first year of the asset's life?
MH6301 Information Retrieval and Analysis Assignment : MH6301 Information Retrieval and Analysis Assignment Help and Solution, Nanyang Technological University - Assessment Writing Service
How is company performing : How is company performing? for ex is the company's revenue growing or decreasing? Are the company's margins increasing or decreasing?
Did the researchers measure the dependent variable : If you are using a survey or a measurement tool to measure the data for your research proposal, it must measure what you are researching.
Describe the concepts of technology application : Review the concepts of technology application as presented in the Resources. Reflect on how emerging technologies such as artificial intelligence may help.
What are the current assets : ABC Ltd. has a Current Ratio of 1.5: 1 and Net Current Assets of Rs. 5,00,000. What are the Current Assets? Rs. 5,00,000, Rs. 10,00,000, Rs. 15,00,000

Reviews

Write a Review

Management Information Sys Questions & Answers

  Information technology and the changing fabric

Illustrations of concepts from organizational structure, organizational power and politics and organizational culture.

  Case study: software-as-a-service goes mainstream

Explain the questions based on case study. case study - salesforce.com: software-as-a-service goes mainstream

  Research proposal on cloud computing

The usage and influence of outsourcing and cloud computing on Management Information Systems is the proposed topic of the research project.

  Host an e-commerce site for a small start-up company

This paper will help develop internet skills in commercial services for hosting an e-commerce site for a small start-up company.

  How are internet technologies affecting the structure

How are Internet technologies affecting the structure and work roles of modern organizations?

  Segregation of duties in the personal computing environment

Why is inadequate segregation of duties a problem in the personal computing environment?

  Social media strategy implementation and evaluation

Social media strategy implementation and evaluation

  Problems in the personal computing environment

What is the basic purpose behind segregation of duties a problem in the personal computing environment?

  Role of it/is in an organisation

Prepare a presentation on Information Systems and Organizational changes

  Perky pies

Information systems to adequately manage supply both up and down stream.

  Mark the equilibrium price and quantity

The demand schedule for computer chips.

  Visit and analyze the company-specific web-site

Visit and analyze the Company-specific web-site with respect to E-Commerce issues

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd