Build the first component and thetext parser

Assignment Help JAVA Programming

Reference no: EM133361773

Information Retrieval and Web Search

Programming Assignment - Text Parser

Description: An IR Engine should include at least the following major components: Text parser, Indexer and Retrieval System. Your first programming assignment is to build the first component, theText parser which will be used by subsequent assignments. You can choose your familiar language as the implementation language.

Note: If you decide to use C++, you might consider using C++ STL (Standard Template Library), which has all the necessary classes. Get familiar with the different types of containers available in STL along with the methods provided. A Text Parser should include the following functionalities: Tokenizer: Reads document into memory, tokenizes to separate words; returns token stream. Basic tokenization rules:

• remove numbers
• ignore if word contains numbers.
• split on all nonalphanumeric characters(such as punctuation marks, spaces, hyphens, and apostrophes)
• convert to lower case

WordDictionary: Build a Dictionary, which assigns each unique word/token to a unique numerical ID and keeps this mapping information (Stemmer Algorithm should be used).

FileDictionary: You also need to keep a Dictionary to map each document name to a unique numerical ID. Data: We are using the TREC data, which contains multiple documents in a file and tags them separately. So you cannot treat each file as a single document, you need to parse them to separate documents.

Testing: You should print out document ids and token streams to see if you properly parse documents. Store the output in a file called "parser_output.txt" in the following form:

Document Preprocessing Steps:-
• Tokenization to handle numbers, punctuation marks, and the case of letters (upper/lower)
• Elimination of stopwords
• Stemming of the remaining words
• Selection of terms for the term dictionary
• Creating the dictionary file (Term Dictionary and Document Dictionary)

Reference no: EM133361773

Questions Cloud

Develop a boosted tree model using gbm : Develop a boosted tree model (using gbm). Using cross-validation, determine how many boosting iterations give the best model (show a plot of this).

What is the correct description of the goal of information : "the goal of information security is to bring residual risk to zero." If it is not true, what is the correct description of the goal of information security

Reason out the correlative obligations : In theory, once one understood the right, one should be able to reason out the correlative obligations.

What are the least and greatest number of leaf nodes : What are the least and greatest number of leaf nodes in a binan.r tree with n nodes. show with examples?

Build the first component and thetext parser : CSCE 5200 Information Retrieval and Web Search, University of North Texas - Tokenization to handle numbers, punctuation marks, and the case of letters

Explain at least two dissemination strategies : Explain at least two dissemination strategies you would be most inclined to use and explain why.

Discuss the usability and testing of a saas application : Discuss the usability and testing of a SAAS application targeted for healthcare systems to be used by the department of Health and Human services of a state

What are the possibilities you might consider : What are the possibilities you might consider in order to speed up the customer's gaming experience?

What gland normally produces this hormone : What hormone is affected by this disease? What gland normally produces this hormone?

User Account

All Pages