MH6301 Information Retrieval and Analysis Assignment

Assignment Help Management Information Sys
Reference no: EM132476241

MH6301 Information Retrieval and Analysis - Nanyang Technological University

Question 1. (a) You are given a collection of 3 documents, listed below. You are asked to create an inverted index for this collection of documents.

D1: SPMS offers Master of Science D2: SPMS offers many courses
D3: MH6301 is a master course

(i) Suppose a white space tokenizer is used to identify the tokens. Briefly describe THREE (3) types of processing that can be applied to the tokens before creating the index.

(ii) Write down the resultant documents after applying the three type of processing in Q1(a)(i).

(iii) Draw the inverted index that would be built for the three documents, after applying the three types of processing on their tokens.

(b) Briefly describe the following concepts.

• Information need
• Query
• Document
• Relevant document
• TFIDF
• Edit distance
• A/B Testing

(c) The table below gives the sizes of postings lists for tokens e, k, m, and s, respectively.

Term

e

K

m

s

Postings size

313

27

107

271

(i) Recommend a query processing order for a Boolean query:

(m OR s) AND (k OR e).

(ii) Estimate the minimum and maximum possible number of results for a Boolean query:

(e OR k) AND (NOT m)

(d) Discuss TWO (2) techniques to process phrase query like "information retrieval".

Question 2. (a) In ordinary English text, the average number of characters per word is
4.5. After indexing, the average length of a dictionary word is 8 characters. Suppose an index has a dictionary with 200,000 words. Assume each character occupies one byte.

(i) Explain why the average number of characters per word in ordinary English text is smaller than that in index dictionary.

(ii) With index compression, the compressed index stores dictionary as a string with block size of 4. Estimate the space usage in number of bytes of this dictionary.

(b) A social media company runs a platform that is similar to Facebook or Twitter. Users can post freestyle messages online and their followers or friends can view the messages immediately. However, users are not allowed to comment on messages. The company has consulted you to develop a search engine to index and search for the user posted messages.

(i) With reference to the messages posted on either Facebook or Twitter, what are the potential challenges in developing this search engine?

(ii) The company explicitly requests for fast response from the search engine and accepts relatively lower quality top K results from the search engine. Briefly describe THREE (3) techniques that can be used to satisfy company's request.

(c) When a user issues a query "application from submission", a search engine returns: do you mean "application form submission"? Discuss how to detect such kind of errors in query and how to give suggestions of correction.

(d) Briefly describe "static summary" and "dynamic summary" in the context of Web search engine.

Question 3. (a) Suppose the four documents in Table Q3a belong to the data science category, and the three documents in Table Q3b belong to the machine learning category. Each document is treated as a training instance.

Table Q3a

ID

Document

1

Microsoft SQL server database

2

Data management

3

Oracle database

4

Data integration

Table Q3b

ID

Data

1

Machine learning

2

Neural network deep learning

3

Learning by rank

(i) Use words as features to build a multinomial Naïve Bayes Classifier.

(ii) Use the classifier to classify the following document:

Data learning

(b) Suppose we have the occurrence table of term "data" in Table Q3c. The null hypothesis is data and IT are independent with a 0.001 chance of being wrong. The critical value for .999 confidence (p = 0.001) is 10.83. Compute χ2 value, and determine whether data and IT are independent.

Table Q3c

 

Term = data

Term data

Class = IT

100

500

Class ≠ IT

200

9000

(c) Describe the k Nearest Neighbors algorithm for text classification.

Question No. 3

(d) Suppose we have a hierarchy of categories, and a set of questions where each question belongs to one or more leaf categories in the hierarchy. Discuss how to build a hierarchical classification model using the set of questions on the given hierarchy?

Question 4. (a) Describe the k-means algorithm, and discuss how to choose k seed clusters for k-means algorithm.

(b) Given a set of k clusters of documents returned by a clustering algorithm, give three approaches of generating clustering labels of each of the k clusters.

(c) What is the difference between topic specific pagerank and general pagerank?

(d) Describe one application of learning for rank.

(f) Suppose that you are required to build a Search Engine for documents with geospatial information. How do you design index?

Reference no: EM132476241

Questions Cloud

Compare the financial results of the company : Compare the financial results of the Company between 2018 and 2019. What other information is needed to make an assessment of Telstra's performance?
How would you explain risks to a pregnant patient : Mayra mentioned pregnancy as a condition that predisposes to clots. What mechanisms are involved at putting these patients more at risk? They are at more risk.
What is benchmarked in the study : Explore your clinical site and relate one quality improvement (QI) study currently being analyzed. What is benchmarked in the study? What role does the nurse.
What is the amount of interest expense recorded by sandhill : Lease is properly classified as a capital lease, what is the amount of interest expense recorded by Sandhill, Inc. in the first year of the asset's life?
MH6301 Information Retrieval and Analysis Assignment : MH6301 Information Retrieval and Analysis Assignment Help and Solution, Nanyang Technological University - Assessment Writing Service
How is company performing : How is company performing? for ex is the company's revenue growing or decreasing? Are the company's margins increasing or decreasing?
Did the researchers measure the dependent variable : If you are using a survey or a measurement tool to measure the data for your research proposal, it must measure what you are researching.
Describe the concepts of technology application : Review the concepts of technology application as presented in the Resources. Reflect on how emerging technologies such as artificial intelligence may help.
What are the current assets : ABC Ltd. has a Current Ratio of 1.5: 1 and Net Current Assets of Rs. 5,00,000. What are the Current Assets? Rs. 5,00,000, Rs. 10,00,000, Rs. 15,00,000

Reviews

Write a Review

Management Information Sys Questions & Answers

  Describe the purpose of erp systems controls

Describe the purpose of ERP systems controls. Address the following in your description: Controls related to receiving and storing goods, supplies and services.

  Explain a business scenario that would require a 1-d array

Arrays can be created as one-dimensional or multi-dimensional to meet different business needs. Explain a business scenario that would require a one dimensional

  Write memo related to technology and computer science

Description - Any Topic with one page (memo) I rather to be one page memo related to technology and computer science or something close to that

  Develop in accordance with the system development life cycle

Your company is looking for ways to leverage the collected data but wants to ensure that the information technology infrastructure will support .

  Overview of information systems table

A description of the information systems type and its benefits. Example of each information systems type, the name of the vendor who built it, and the vendor's website

  Give the case background and organizational environment

The case presented in Module 3 is another real-world situation using advancements in technology to improve health care and IT governance.

  Business plan for a music school

Write a Business Plan - Imagine writing a business plan for a music school. Focus on the aspect of The Leadership (Management) Team/Human resources requirements

  How can you protect your it product or service idea

Based on your research and readings, post answers to the following questions: How can you protect your IT product or service idea?

  What is a vpn and what technologies are used to create one

How are authentication and authorization alike and how are they different. What is the relationship,if any,between the two?

  Design of a smart application to solve business need

The purpose of this assignment is to prepare a written report that will present a detail plan and design of a smart application to solve business need stated in a business case. In this assignment you are working in a group of 3-4 students but you..

  Describe the primary types of traffic that it may contain

A converged network is one in which the data, voice, and video traffic coexist on a single network." Identify one (1) real-world converged network and describe the primary types of traffic that it may contain.

  Prepare an outline of your presentation

You have recently been hired by the State of Maryland as a Health Information Management Consultant. Your first assignment is to prepare a presentation.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd