Apply your IR skills to build a processing pipeline

Assignment Help Computer Engineering
Reference no: EM132238093

Information Retrieval Assignment - Indexing for Web Search

The assignment involves building a processing pipeline that turns a Website into structured knowledge. All the instructions and questions for the task are given in the PDF file attached below. Detailed explanations needed Specific requirements.

The Task - Your task is to apply your IR skills to build a processing pipeline that turns a Web site into structured knowledge (thus enhancing your chances of getting the job outlined above). Your system should take HTML pages as input, process them using the kind of techniques that we have been looking at in the module, and output an index of terms identified in the documents.

This assignment comes in stages. Marks are given for each stage. You may choose not to attempt some stages. You might also implement a system that does not strictly follow the stages but will work in the same way. The stages are as follows:

  • Engineering a Complete System - The system you develop must be able to read Web pages from a specified set of URLs and produce appropriately formatted output. The Web pages should be processed one at a time using the steps outlined below. The final system should have control over all the individual components so that there is a single call and all the steps outlined below will be performed.
  • HTML Parsing - Before the text can be analyzed it is necessary to get rid of the HTML tags. The result will be plain text. Note that if you simply delete all HTML tags, you will lose information such as meta tag keywords. Use an appropriate tool to perform this task.
  • Pre-processing - Sentence Splitting, Tokenization and Normalization (10%) The next step should be to transform the input text into a normal form of your choice. This should include the identification of sentences, bullet points and cells in tables.
  • Part-of-Speech Tagging - The input should be tagged with a suitable part-of-speech tagger, so that the result can then be processed in the next steps.
  • Selecting Keywords - One aim of your system is to identify the words and phrases in the text that are most useful for indexing purposes. Your system should remove words which are not useful, such as very frequent words or stop words, and identify phrases suitable as index terms. Apply tf.idf as part of your selection and weighting step.
  • Stemming or Morphological Analysis - Writing word stems to the database rather than words allows to treat various inflected forms of a word in the same way, i.e. bus and busses refer to exactly the same thing even though they are different words.

The report for the Assignment should be written in Microsoft Word format with the following information:

a. Description of the implementation

b. Output produced when the system is applied to the 2 web pages given in the assignment.

c. Output produced by each stage of the processing pipeline for each of the two files.

d. Discussion of your solution focusing on functionality implemented and possible improvements/extensions.

All this information has been listed out in the assignment instructions.

Attachment:- Assignment File.rar

Reference no: EM132238093

Questions Cloud

How does magic among wicca differ from other forms of magic : Several individuals who claim they can contact the dead have become popular on television. Do you they really possess this ability? Why or why not?
Summarize the functions of the governmental organizations : Describe and summarize the functions of the Governmental organizations that have established to oversee quality in healthcare.
Prices are rising for walmart-consumers may feel the pain : It’s getting more expensive for retailers like Walmart Inc. to stock its shelves with household staples like diapers, paper towels and bottled water.
Research classroom walkthroughs : Research classroom walkthroughs and write a three page summary paper on this topic.
Apply your IR skills to build a processing pipeline : CE306 - Information Retrieval Assignment - Indexing for Web Search, University of Essex, UK. Your task is to apply your IR skills to build processing pipeline
Difference between stakeholders and interest groups : What is the difference between stakeholders and interest groups, if any?
Country-of-origin as part of his branding strategy : What are the pros and cons (briefly) of Redmond using country-of-origin as part of his branding strategy?
Analyze the importance of having a competitive advantage : Analyze the importance of having a competitive advantage in health care. Recommend two (2) actions that a hospital could take in order to achieve.
Business needs for day-to-day functioning : A thief has been active on your premises. She has been stealing many of the supplies that your business needs for day-to-day functioning.

Reviews

len2238093

2/20/2019 2:02:41 AM

Instructions - The task involves to build a processing pipeline that turns a Website into structured knowledge. All the instructions and questions for the task are given in the PDF file attached below. Detailed explanations needed Specific requirements. The code should be well commented on as required in the assignment instructions.

len2238093

2/20/2019 2:02:36 AM

The report for the Assignment should be written in Microsoft Word format with the following information: a. Description of the implementation b. Output produced when the system is applied to the 2 web pages given in the assignment. c. Output produced by each stage of the processing pipeline for each of the two files. d. Discussion of your solution focusing on functionality implemented and possible improvements/extensions. All this information has been listed out in the assignment instructions. A simple read me text file should be made explaining how I should run the code and program. Please, the work in the report should be labeled correctly and distinguished from one another. In general, all work should be labeled correctly and should be easily identified.

len2238093

2/20/2019 2:02:30 AM

You will have noticed that the percentages above only add up to 70%. This is because one of the important aspects of the project is that your work should be well documented and your code well commented. 30% of your mark will come from this. In addition to the actual code you should submit: A description of your implementation: what the code does, and the software you used. You may work in pairs. If you do, you each need to submit the same report (please include information about which two reports should be treated as a pair). Both members of a pair will get the same mark unless there is reason to do otherwise.

len2238093

2/20/2019 2:02:25 AM

Output produced by each stage of the processing pipeline for each of the two files, i.e. in the suggested staged architecture outlined above this would be output produced by the HTML parser, followed by the output of the sentence splitter, the tokenizer etc. Each stage should produce a separate file (however, to calculate weights such as tf.idf for the final index terms you will have to consult data derived from all documents and it is up to you how exactly you include that step in your processing pipeline). A short discussion of your solution focussing on functionality implemented and possible improvements and extensions.

len2238093

2/20/2019 2:02:20 AM

You can implement your system either on the Linux or the Windows machines. Python, Java, Perl, C/C++, and shell scripts are good choices for this project, but you are by no means restricted to those languages. Identify suitable open-source tools that help you building your pipeline. Submission - The assignment, which counts for 20% of the overall mark, should be submitted as a single zip file via the electronic submission system by Friday, 22, 11:59 (mid-day). The guidelines about late assignments are explained in the students’ handbook.

Write a Review

Computer Engineering Questions & Answers

  What are the benefits of software engineering

What is the appropriate justification for an employee that can be submitted to his/her business line to persuade to accept the study of software engineering.

  Define array type team-record-array with integer components

Define an array type Team_Record_Array with Integer components indexed by Teams. Declare an array variable Win_Loss to be of type Team_Record_Array.

  Name the database clothingstore.mdb

Edit Relationship between Customers and Orders and check Cascade Update Related Fields and check Cascade Delete Related Fields.

  Employeeexception class whose constructor receives a string

build an Employee class with two fields, idNum and hourlyWage. The Employee constructor requires values for both fields. Upon construction, it throw an EmployeeException if the hourlyWage is less than 6.00 or over 50.00. Save the class as Employee..

  Develop a gui application that includes a text area

Develop a GUI application that includes a text area, 5 buttons, and a keyboard.

  Which of the following schemes is a cryptosystem what is

which of the following schemes is a cryptosystem? what is the plaintext space the ciphertext space and the key space?a

  What is the maximum value achievable by count

What is the maximum value achievable by count when inside is executed on a polygon with N vertices? Give an example supporting your answer.

  Questionwalk through of how to compute any of these would

questionwalk through of how to compute any of these would be greatly appreciated.a what is dft of a pure cosine wave

  Write a paper about distributed and mobile computing

Write a paper about DISTRIBUTED AND MOBILE COMPUTING. Summary should include Abstract, A description of the problem and the settings.

  Define an application to include classes for student

Define an application to include classes for Student, GraduateStudent, and UndergraduateStudent. Create .DLL files for the three classes.

  Briefly define e business and e commerce

Briefly define e-business and e-commerce. How are they related?- Explain the relationship between the periodic mode and batch processing.

  Questionwrite down a program that mimics a flop-turn-river

questionwrite down a program that mimics a flop-turn-river of a poker game. i must have three buttons titled deal

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd