Apply your IR skills to build a processing pipeline

Assignment Help Computer Engineering
Reference no: EM132238093

Information Retrieval Assignment - Indexing for Web Search

The assignment involves building a processing pipeline that turns a Website into structured knowledge. All the instructions and questions for the task are given in the PDF file attached below. Detailed explanations needed Specific requirements.

The Task - Your task is to apply your IR skills to build a processing pipeline that turns a Web site into structured knowledge (thus enhancing your chances of getting the job outlined above). Your system should take HTML pages as input, process them using the kind of techniques that we have been looking at in the module, and output an index of terms identified in the documents.

This assignment comes in stages. Marks are given for each stage. You may choose not to attempt some stages. You might also implement a system that does not strictly follow the stages but will work in the same way. The stages are as follows:

  • Engineering a Complete System - The system you develop must be able to read Web pages from a specified set of URLs and produce appropriately formatted output. The Web pages should be processed one at a time using the steps outlined below. The final system should have control over all the individual components so that there is a single call and all the steps outlined below will be performed.
  • HTML Parsing - Before the text can be analyzed it is necessary to get rid of the HTML tags. The result will be plain text. Note that if you simply delete all HTML tags, you will lose information such as meta tag keywords. Use an appropriate tool to perform this task.
  • Pre-processing - Sentence Splitting, Tokenization and Normalization (10%) The next step should be to transform the input text into a normal form of your choice. This should include the identification of sentences, bullet points and cells in tables.
  • Part-of-Speech Tagging - The input should be tagged with a suitable part-of-speech tagger, so that the result can then be processed in the next steps.
  • Selecting Keywords - One aim of your system is to identify the words and phrases in the text that are most useful for indexing purposes. Your system should remove words which are not useful, such as very frequent words or stop words, and identify phrases suitable as index terms. Apply tf.idf as part of your selection and weighting step.
  • Stemming or Morphological Analysis - Writing word stems to the database rather than words allows to treat various inflected forms of a word in the same way, i.e. bus and busses refer to exactly the same thing even though they are different words.

The report for the Assignment should be written in Microsoft Word format with the following information:

a. Description of the implementation

b. Output produced when the system is applied to the 2 web pages given in the assignment.

c. Output produced by each stage of the processing pipeline for each of the two files.

d. Discussion of your solution focusing on functionality implemented and possible improvements/extensions.

All this information has been listed out in the assignment instructions.

Attachment:- Assignment File.rar

Reference no: EM132238093

Questions Cloud

How does magic among wicca differ from other forms of magic : Several individuals who claim they can contact the dead have become popular on television. Do you they really possess this ability? Why or why not?
Summarize the functions of the governmental organizations : Describe and summarize the functions of the Governmental organizations that have established to oversee quality in healthcare.
Prices are rising for walmart-consumers may feel the pain : It’s getting more expensive for retailers like Walmart Inc. to stock its shelves with household staples like diapers, paper towels and bottled water.
Research classroom walkthroughs : Research classroom walkthroughs and write a three page summary paper on this topic.
Apply your IR skills to build a processing pipeline : CE306 - Information Retrieval Assignment - Indexing for Web Search, University of Essex, UK. Your task is to apply your IR skills to build processing pipeline
Difference between stakeholders and interest groups : What is the difference between stakeholders and interest groups, if any?
Country-of-origin as part of his branding strategy : What are the pros and cons (briefly) of Redmond using country-of-origin as part of his branding strategy?
Analyze the importance of having a competitive advantage : Analyze the importance of having a competitive advantage in health care. Recommend two (2) actions that a hospital could take in order to achieve.
Business needs for day-to-day functioning : A thief has been active on your premises. She has been stealing many of the supplies that your business needs for day-to-day functioning.

Reviews

len2238093

2/20/2019 2:02:41 AM

Instructions - The task involves to build a processing pipeline that turns a Website into structured knowledge. All the instructions and questions for the task are given in the PDF file attached below. Detailed explanations needed Specific requirements. The code should be well commented on as required in the assignment instructions.

len2238093

2/20/2019 2:02:36 AM

The report for the Assignment should be written in Microsoft Word format with the following information: a. Description of the implementation b. Output produced when the system is applied to the 2 web pages given in the assignment. c. Output produced by each stage of the processing pipeline for each of the two files. d. Discussion of your solution focusing on functionality implemented and possible improvements/extensions. All this information has been listed out in the assignment instructions. A simple read me text file should be made explaining how I should run the code and program. Please, the work in the report should be labeled correctly and distinguished from one another. In general, all work should be labeled correctly and should be easily identified.

len2238093

2/20/2019 2:02:30 AM

You will have noticed that the percentages above only add up to 70%. This is because one of the important aspects of the project is that your work should be well documented and your code well commented. 30% of your mark will come from this. In addition to the actual code you should submit: A description of your implementation: what the code does, and the software you used. You may work in pairs. If you do, you each need to submit the same report (please include information about which two reports should be treated as a pair). Both members of a pair will get the same mark unless there is reason to do otherwise.

len2238093

2/20/2019 2:02:25 AM

Output produced by each stage of the processing pipeline for each of the two files, i.e. in the suggested staged architecture outlined above this would be output produced by the HTML parser, followed by the output of the sentence splitter, the tokenizer etc. Each stage should produce a separate file (however, to calculate weights such as tf.idf for the final index terms you will have to consult data derived from all documents and it is up to you how exactly you include that step in your processing pipeline). A short discussion of your solution focussing on functionality implemented and possible improvements and extensions.

len2238093

2/20/2019 2:02:20 AM

You can implement your system either on the Linux or the Windows machines. Python, Java, Perl, C/C++, and shell scripts are good choices for this project, but you are by no means restricted to those languages. Identify suitable open-source tools that help you building your pipeline. Submission - The assignment, which counts for 20% of the overall mark, should be submitted as a single zip file via the electronic submission system by Friday, 22, 11:59 (mid-day). The guidelines about late assignments are explained in the students’ handbook.

Write a Review

Computer Engineering Questions & Answers

  Mathematics in computing

Binary search tree, and postorder and preorder traversal Determine the shortest path in Graph

  Ict governance

ICT is defined as the term of Information and communication technologies, it is diverse set of technical tools and resources used by the government agencies to communicate and produce, circulate, store, and manage all information.

  Implementation of memory management

Assignment covers the following eight topics and explore the implementation of memory management, processes and threads.

  Realize business and organizational data storage

Realize business and organizational data storage and fast access times are much more important than they have ever been. Compare and contrast magnetic tapes, magnetic disks, optical discs

  What is the protocol overhead

What are the advantages of using a compiled language over an interpreted one? Under what circumstances would you select to use an interpreted language?

  Implementation of memory management

Paper describes about memory management. How memory is used in executing programs and its critical support for applications.

  Define open and closed loop control systems

Define open and closed loop cotrol systems.Explain difference between time varying and time invariant control system wth suitable example.

  Prepare a proposal to deploy windows server

Prepare a proposal to deploy Windows Server onto an existing network based on the provided scenario.

  Security policy document project

Analyze security requirements and develop a security policy

  Write a procedure that produces independent stack objects

Write a procedure (make-stack) that produces independent stack objects, using a message-passing style, e.g.

  Define a suitable functional unit

Define a suitable functional unit for a comparative study between two different types of paint.

  Calculate yield to maturity and bond prices

Calculate yield to maturity (YTM) and bond prices

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd