Creating the inverted index

Assignment Help Database Management System
Reference no: EM131850348

Database and Information Retrieval

Question 1: Suppose you have joined a search engine development team to design a search algorithm based on both the Vector model and the Boolean model. You are supposed to collect unstructured documents for the following topics, and apply an index technique to convert them into an inverted index.

Please collect 3 documents (less than 30 words for each) in three different topics. Topics are listed as follows, you can also choose some other topics you prefer.

  • Science
  • Computer Vision
  • Search Engine
  • Database
  • Security and privacy.

An example of document:

"Google is the most widely used Web search engine in the World. It claims to be the World's most comprehensive search engine, indexing over 2.4 billion Web pages."

1. Creating the inverted index. In the process of creating the inverted index, please complete the following steps:

a. Find a stopword list in the Internet and remove all stopwords and punctuation from those three documents.

Then apply Porter's stemming algorithm to all documents. Note that there are plenty of online stemming applications available, and you may use Porter algorithm for this question. The output will be a set of stemmed terms.

b. Create a merged inverted list including the within-document frequencies for each term.

c. Use the index created in step (b) to create a dictionary and the related posting file.

d. You may like to test the inverted index by using some keywords, please select some keywords from the documents. For example: google, web, search.

2. Boolean and Vector queries.

a. Please design three Boolean queries, (for example, web AND search) and list the relevant documents for each query.

b. Please use the Vector model to query on the inverted index, and compare the result with the Boolean model. (Hint: you can use cosine similarity and set a similarity threshold)

Question 2 (IR Evaluation): For this exercise, you are required to evaluate the performance of different search engines.

First, please find two search engines you are familiar with, such as Google, Bing, Yahoo!, etc.

Second, please choose a target in the following groups, and design two queries to search in both search engines.

The target is chosen by the last number of your student ID. For example, if your student ID ends with the number is 1, please choose target 1; if it is 0, please choose target 10.

  • Target 1: obtain the unit guide of SIT771.
  • Target 2: obtain the unit guide of SIT772.
  • Target 3: obtain the unit guide of SIT773.
  • Target 4: obtain the unit guide of SIT774.
  • Target 5: obtain the price of the new Macbook.
  • Target 6: obtain the price of the new iPhone.
  • Target 7: obtain the price of a Lenovo Laptop.
  • Target 8: obtain the install document of MongoDB.
  • Target 9: obtain the manual of MongoDB.
  • Target 10: obtain the operation guide of MongoDB.

Select the first 20 results in both search engines, if they return the target, then mark them as relevant documents, otherwise, they are irrelevant. The following exercises are based on your search results.

a. List your target and designed search queries (you can use any keywords you think are related to the target).

For Search Engine 1, plot the precision versus recall curves for Query 1 and Query 2, interpolated to the 11 standard recall levels.

Also plot the average precision versus recall curve for Search Engine 1 (all three curves should be on a single chart).

b. For Search Engine 2, plot the precision versus recall curves for Query 1 and Query 2, interpolated to the 11 standard recall levels.

Also plot the average precision versus recall curve for Search Engine 2 (all three curves should be on a single chart, but a separate chart from that used in part (a)).

c. Plot the averages for Search Engine 1 and Search Engine 2 on a separate chart, and compare the algorithms in terms of precision and recall. Which search engine do you think is superior? Why?

Reference no: EM131850348

Questions Cloud

Discuss why is some degree of monopoly power permitted : Discuss why is some degree of monopoly power permitted? In this Apply the Principle, you will have to decide if Google has too much monopoly power.
What is a brand : What is a brand? Provide an example of a brand that you buy frequently and describe the mental image that pops into your mind when you hear or see the brand's.
Explain the interpersonal communication skills of listening : Explain the three interpersonal communication skills of listening, feedback and questioning, and critically discuss the benefits of applyingthese skills.
Global functional instead of global geographic structure : Based on your knowledge of organizational design/structure, is the structure appropriate or would you advise for organizational redesign. Why, why not?
Creating the inverted index : SIT772 Database and Information Retrieval. Creating the inverted index. In the process of creating the inverted index
Discuss strategies the company or organization has used : Discuss strategies the company or organization has used in its efforts to organize its international operations. How effective have those strategies been?
What are the pros and cons of remaining independent : What are the pros and cons of remaining independent? Going public? Selling the company?
Examine performance management issues and processes : You are the HR manager for a small retail company that sells a high volume of products over the Internet. Your company is growing rapidly due to increased.
What products are involved and what is the timing of entry : What products/services are involved and what is the timing of entry? The possible risks and benefits associated with the strategy.

Reviews

len1850348

2/5/2018 11:56:27 PM

This assessment task enables students to demonstrate their proficiency against Unit Learning Outcome 5. ULO5: Demonstrate data retrieval skills in the context of a data processing system. Please read the rubric carefully as it outlines what criteria your assessment will be evaluated on. Instructions - Read these instructions. Answer as many questions as possible. Place your name, ID and answers in your document. Please submit your word file with your answers and graphs (embedded)where appropriate as a SINGLE document in the Submission Portal.

Write a Review

Database Management System Questions & Answers

  Explaining emp table does not have assigned primary key

Developers have noticed that EMP table doesn't have an assigned primary key. EMP_ID column has been selected to be the primary key.

  Create a table using design view

You also create a relationship between the two tables and enforce referential integrity constraints. Finally, you add data using the relationship between tables. This project has been modified for use in SIMnet

  Explain step that you would use in order to convert database

Describe the steps that you would use in order to convert database tables to the First Normal Form, the Second Normal Form, and the Third Normal Form.

  Construct a binary search tree for given statement

An inorder tree traversal of a binary search tree produces a listing of the tree nodes in alphabetical or numerical order.

  Difference between e-r diagram and the data dictionary

E-R diagram specializes a logical graphic model and it doesn't contain any data type or data information in a database.

  Activity-based costing to assign overhead costs to products

How much overhead cost would be assigned to each of the two products using the company's activity-based costing system

  What is a conceptual design process

When designing and developing a database, it is imperative that we utilize a conceptual design process. What is a conceptual design process

  Advantages and disadvantages of joins and nested queries

One side effect of normalization is that you often need more than one table to get meaningful results. For example, you may have a table that includes a list of parts and the ID number for the vendor of each part. T

  Identified in the system analysis report

An E-R diagram of your database that clearly shows the primary and foreign keys - Database has minimum number of records in each table as specified in assignment instructions.

  Design a simple dashboard for your data warehouse

Use SSIS perform ETL on the invoice data to populate the data warehouse, in doing so you should demonstrate the use of SSIS filters/expressions and SQL where appropriate - Create an OLAP cube from your data warehouse and perform.

  Cover topic of usability in the field of interface design

Use the Internet to locate two articles that cover the topic of universal usability in the field of interface design. Be prepared to discuss.

  Describing the database design development life cycle

You will write a report describing the database design development life cycle and describe in details the steps needed to properly create the database you have selected for your Key Assignment

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd