Reference no: EM132378019
ISYS1078 Web Search Engines and Information Retrieval, RMIT University
Web Search Engines and Information Retrieval ISYS1078/1079
Overview
In this assignment, you will implement a ranked querying system, run a set of sample queries, and evaluate system performance. You will need to design and implement appropriate data structures for efficient searching.
In the first assignment, you implemented text processing and indexing. This second assignment focuses on search; the aims of the assignment are to enhance your under- standing of retrieval models and evaluation, and to give you practical experience with the implementation of search algorithms.
For this assignment, you should build on (by re-using and modifying where required) your indexing code from assignment 1. As usual, wherever code is re-used (including your own) this must be clearly identified in your submission.
The "Web Search Engines and Information Retrieval" Canvas contains further announce- ments and a discussion board for this assignment. Please be sure to check these on a regular basis - it is your responsibility to stay informed with regards to any announce- ments or changes.
Learning Outcomes
This assessment relates to the following learning outcomes:
CLO1: apply information retreival principles to locate relevant information in large collections of data
CLO3: implement features of retrieval systems for web-based and other search tasks
CLO4: analyse the performance of retrieval systems using test collections
CLO5: make practical recommendations about deploying information retrieval sys- tems in different search domains, including considerations for document manage- ment and querying
Description:
In this assignment, you will implement a ranked querying system, run a set of sample queries, and evaluate system performance. You will need to design and implement appropriate data structures for efficient searching.
In the first assignment, you implemented text processing and indexing. This second assignment focuses on search; the aims of the assignment are to enhance your understanding of retrieval models and evaluation, and to give you practical experience with the implementation of search algorithms.
General Requirements
This section contains information about the general requirements that your assignment must meet. Please read all requirements carefully before you start.
You must implement your programs in Java, C, C++, or Python. Your programs should be well written, using good coding style and including appropriate use of comments. Your markers will look at your source code, and coding style may form part of the assessment of this assignment.
You must include a plain text file called "README.txt" with your submission. This file should include the name and student ID of all team members at the top. It needs to clearly explain how to compile and run your programs on (titan|saturn|jupiter). The clarity of the instructions and usability of your programs may form part of the assessment of this assignment.
Your programs may be developed on any machine, but must compile and run on the course machines, (titan|saturn|jupiter). If your submission does not compile and run on these machines, it will not be marked.
The submitted programs must be your own code. You should not use existing external libraries that implement advanced data structures. Where libraries (or in the case of scripting languages, built-in features beyond simple low-level data types) are used for data structures such as hash tables, they must be clearly attributed, and it is up to you to demonstrate a clear understanding of how the library is implemented in the discussion in your assignment report.
• Paths should not be hard-coded.
Where your programs need to create auxiliary files, these should be stored in the current working directory.
Please ensure that your submission follows the file naming rules specified in the tasks below. File names are case sensitive, i.e. if it is specified that the file name is gryphon, then that is exactly the file name you should submit; Gryphon, GRYPHON, griffin, and anything else but gryphon will be rejected.
Programming Tasks
You will find all files needed for this assignment in the directory
/home/inforet/a2
on (titan|saturn|jupiter).csit.rmit.edu.au.
First, have a look at the file latimes. It is part of the TREC ad hoc retrieval test collection, and is comprised of 131896 newspaper documents.
You will need to write programs to perform searches on this data. As a preliminary step, you should therefore be able to index the data efficiently, for example using an inverted index. This will enable to you to easily access the various term occurrence statistics that you need to use for the searching tasks described below.
For indexing, you are encouraged to re-use your code (possibly with modifications) from Assignment 1. However, it is your responsibility to verify the accuracy of your indexing code.
Note: If you have concerns about the functionality of your indexing code from assignment 1, you may optionally make use of a provided inverted index dump file
index dump/invlist-TermFreq.txt
and document map file
index dump/map
If you choose to make use of these files, you will need to load them into memory and retrieve the term occurrence statistics using appropriate data structures. A README.txt file explaining the format is included in the same directory.
1 Ranked Retrieval
Write a program called search that implements the Okapi BM25 similarity function to rank documents in a text collection. An example of how your program should be invoked is:
./search -BM25 -q <query-label> -n <num-results> -l <lexicon> -i <invlists>
-m <map> [-s <stoplist>] <queryterm-1> [<queryterm-2> ... <queryterm-N>]
(or equivalent invocation in another programming language) where
• -BM25 specifies that the BM25 similarity function is to be used.
-q <query-label> is an integer that identifies the current query. For example, you might indicate that the current query label is "1133". This item is only used for output formatting (see below).
-n <num-results> is an integer number specifying the number of top-ranked doc- uments that should be returned as an answer.
-l <lexicon> and -i <invlists> are the inverted index lexicon and inverted list files; and -m <map> is the mapping table from internal document numbers to actual document identifiers. (See Assignment 1 for details. You are encouraged to re-use your indexing code from Assignment 1, with modifications where required.)
-s <stoplist> is an optional argument which may indicate a stoplist; note that if the inverted index was created with stopping, then queries should be stopped in exactly the same way.
<queryterm-1> [<queryterm-2> ... <queryterm-N>] are a variable number of query terms (but at least one query term) that the program will receive as command- line arguments. Each of these terms should be looked up in the lexicon, and the appropriate inverted list data fetched from the inverted list file.
Your search program should be efficient, using suitable data structures and algorithms. As a minimum, your program should:
• Process a query, one term at a time.
Use accumulators to accrue partial similarity scores as each term is processed (see Retrieval Models I lecture notes).
Use a min-heap data structure to select the top n accumulators with the highest similarity scores.
Calculate document weights at indexing time, and store these in the mapping table (map file) for fast lookup at query time.
2 Ranked Retrieval: Report
Create a PDF file called report.pdf. Create a heading called "Ranked Retrieval" in your report file In this section you should:
• Explain your implementation of the BM25 similarity function.
• Clearly describe the steps by which you process a query of one or more terms.
Explain the key algorithms and data structures that you have used. You should refer to specific sections of your code by citing line numbers in the source files to demonstrate where you have implemented these.
This section should be no longer than one and a half pages.
For all questions that involve writing a report, you must remember to cite any sources (including books, research papers, course notes, etc.) that you referred to while designing aspects of your programs.
3 Advanced Information Retrieval Feature
In this section, you will choose one advanced IR feature, which you will then implement and evaluate.
You may choose from the following list. Some of these items have been covered in lectures, while others are new.
All of these topics will require you to first carry out some research: you should aim to identify and read at least one research paper that provides a foundation for the feature that you will implement. This must be cited in your report.
Whichever advanced feature you choose, your implementation/program must be called advanced and clear details for how to run it need to be provided in a README.txt file.
1. Automatic query expansion (pseudo-relevance feedback): Implement an extension to your ranked retrieval approach that implements an automatic query expansion approach (for example, the Okapi Term Selection Value approach, or another approach).
Such an approach should include at least two parameters that can be set at run- time: R, the number of top-ranked documents that are assumed to be relevant; and E, the number of terms that should be appended to the original query.
2. Manual relevance feedback: Implement an extension to your ranked retrieval approach that implements manual relevance feedback.
Such approaches must include an interactive feedback phase, where a user is pre- sented with a list of documents in response to an initial query, and is given the opportunity to mark certain documents as relevant (and, optionally, to mark some as non-relevant).
The initial query then needs to be updated, for example using Rocchio's approach. Finally, the updated query is run to retrieve the final answer list.
3. Phrase search: Extend your ranked retrieval implementation to support phrase queries (for example, the query "cold fusion" would return all documents (and only those documents) that contain the term cold followed directly by the term fusion).
Note that to support phrase search, you will first need to extend your inverted index to also store term positions within documents.
4. Diversity ranking: Extend your ranked retrieval approach to include a diversity component. For example, for an ambiguous query such as java, a diversity approach would seek to ensure that the top answers include documents that cover various interpretations of the query (the island, the coffee, the programming language).
An example of a diversity approach is Maximal Marginal Relevance (MMR).1
5. Disk-based inverted index construction: Modern collections are often too large to fit into main memory. Constructing an inverted index therefore requires various efficiency considerations.
The lecture on indexing included information on an external sort-based approach for constructing an inverted index. Implement this as a baseline, then explore and implement a more advanced approach.
6. Document summaries: Extend your search system to produce short document summaries with each answer item returned in a search results list.
You must implement at least two summary creation approaches. One of these should be based on query-biased information, taking the user's query into account. The second should include evidence other than the user's current query (for example, a very simple choice would be to simply return the first chunk of a document).
7. Other advanced feature: If you wish to propose another advanced IR feature, you are welcome to do so. However, you must send an email to the lecturer to discuss this, and to agree on the scope of what such a feature needs to cover, and how it should be evaluted.
4 Advanced Information Retrieval Features: Report
In your report file, create a heading "Advanced IR Feature". In this section, you should explain:
Your chosen advanced IR feature. This must include a clear description of what you are aiming to achieve. You must include a reference to at least one research paper that you have referred to as part of your design.
• A detailed explanation of how you implemented the feature.
This section should be no longer than one and a half pages.
5 Evaluation: Report
You have implemented at least two aspects of an IR system. Your task now is to exper- imentally evaluate the relative performance of these. Depending on which advanced IR feature you have chosen, you should perform one of the following evaluations.
Effectiveness evaluation (advanced features 1 and 2)
The file topics contains five queries (one on each line) with the following format:
401 foreign minorities germany
402 behavioral genetics
403 osteoporosis
405 cosmic events
408 tropical storms
where the first item is a query number (identifier) followed by a space, and the subsequent terms on the line is the query that is to be submitted to your search system.
A corresponding file, qrels, contains relevance judgements for these queries and the collection that you have indexed. These judgements were made by a human assessor, who considered the query and the answer document, and decided whether the document is relevant for the given query. An example line from the file is:
401 0 LA010189-0003 0
401 0 LA010489-0026 1
401 0 LA010689-0066 0
where the first item is a query number, the second item should be ignored, the third is a document identifier, and the fourth item is a relevance value (a "1" indicates that this document is relevant for the query, and a "0" indicates not relevant). So, the example above indicates that the document LA010489-0026 is relevant for query number 401, and the documents LA010189-0003 and LA010689-0066 are not relevant.
Answer the following questions in your report:
1. Your first task is to run the queries using both of your retrieval approaches. You should generate answer lists of length 20. You must include the full answer lists for both systems in your report.
2. Compare the systems using precision at 10 documents retrieved (P@10) as a metric. Explain this performance measurement, and discuss which system is better based on the scores. (You do not need to carry out statistical significance tests.)
Is this a sensible metric for comparing the two systems that you have implemented? Why or why not? You should write no more than half a page for this section.
3. Based on your knowledge of retrieval evaluation, choose a different evaluation metric to compare the two systems.
Explain whether plain ranked retrieval or your chosen advanced approach is better according to your chosen metric. Describe why your chosen metric is appropriate for comparing the performance of these two systems. You should write no more than half a page for this section.
Effectiveness evaluation for phrase search (advanced feature 3)
Consider the set of test queries described in section 5.1. Then answer the following questions in your report:
1. Your first task is to run the queries using both of your retrieval approaches (ranked retrieval, and phrase retrieval). You should generate answer lists of length 20 for ranked retrieval, and report the full answer list for your phrase search. Include the answer lists in your report.
2. Compare the systems using precision at 10 documents retrieved (P@10) as a metric. Explain this performance measurement, and discuss which system is better based on the scores. (You do not need to carry out statistical significance tests.)
Is this a sensible metric for comparing the two systems that you have implemented? Why or why not? You should write no more than half a page for this section.
3. Propose a way to evaluate your phrase retrieval search. You should explain the proposed process, and any components that you would need, in detail. However, you do not need to create the components and carry out an actual evaluation. You should write no more than half a page for this section.
Effectiveness evaluation for diversity (advanced feature 4)
Consider the set of test queries described in section 5.1. Then answer the following questions in your report:
1. Your first task is to run the queries using both of your retrieval approaches (ranked retrieval, and diversity retrieval). You should generate answer lists of length 20 for both your ranked retrieval and diversity approaches. Include the answer lists in your report.
2. Compare the systems using precision at 10 documents retrieved (P@10) as a metric. Explain this performance measurement, and discuss which system is better based on the scores. (You do not need to carry out statistical significance tests.)
Is this a sensible metric for comparing the two systems that you have implemented? Why or why not? You should write no more than half a page for this section.
3. Propose a way to evaluate your diversity ranking. You should explain the proposed process, and any components that you would need, in detail. However, you do not need to create the components and carry out an actual evaluation. You should write no more than half a page for this section.