Reference no: EM133361773
Information Retrieval and Web Search
Programming Assignment - Text Parser
Description: An IR Engine should include at least the following major components: Text parser, Indexer and Retrieval System. Your first programming assignment is to build the first component, theText parser which will be used by subsequent assignments. You can choose your familiar language as the implementation language.
Note: If you decide to use C++, you might consider using C++ STL (Standard Template Library), which has all the necessary classes. Get familiar with the different types of containers available in STL along with the methods provided. A Text Parser should include the following functionalities: Tokenizer: Reads document into memory, tokenizes to separate words; returns token stream. Basic tokenization rules:
• remove numbers
• ignore if word contains numbers.
• split on all nonalphanumeric characters(such as punctuation marks, spaces, hyphens, and apostrophes)
• convert to lower case
WordDictionary: Build a Dictionary, which assigns each unique word/token to a unique numerical ID and keeps this mapping information (Stemmer Algorithm should be used).
FileDictionary: You also need to keep a Dictionary to map each document name to a unique numerical ID. Data: We are using the TREC data, which contains multiple documents in a file and tags them separately. So you cannot treat each file as a single document, you need to parse them to separate documents.
Testing: You should print out document ids and token streams to see if you properly parse documents. Store the output in a file called "parser_output.txt" in the following form:
Document Preprocessing Steps:-
• Tokenization to handle numbers, punctuation marks, and the case of letters (upper/lower)
• Elimination of stopwords
• Stemming of the remaining words
• Selection of terms for the term dictionary
• Creating the dictionary file (Term Dictionary and Document Dictionary)