Solution-CE706 Information Retrieval Assignment

CE706 Information Retrieval Assignment

Assignment Help Other Subject

Reference no: EM132801237

CE706 Information Retrieval - University of Essex

Scenario: In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19) . CORD-19 is a resource of over 181,000 scholarly articles, including over 80,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in information retreival and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.

Your task

This task comes in stages. Marks are given for each stage. The stages are as follows:
• Indexing (20%) The first step for you will be to obtain the dataset. Once you have done so upload a sample of 1000 articles with full text to Elasticsearch (the simplest thing is to use the first 1000 documents). You will work with the metada.csv file provided by the challenge.
• Sentence Splitting, Tokenization and Normalization - The next step should be to transform the input text into a normal form of your choice. This should include the identification of sentences, bullet points and cells in tables.
• Selecting Keywords - One aim of your system is to identify the words and phrases in the text that are most useful for indexing purposes. Your system should remove words which are not "useful". E.g. very frequent words or stopwords. You should also identify phrases suitable as index terms. Apply tf.idf as part of your selection and weighting step.

• Stemming or Morphological Analysis - Writing word stems to the database rather than words allows to treat various inflected forms of a word in the same way, e.g.bus and busses refer to exactly the same thing even though they are different words.
• Searching (10%) Once you have indexed the collection you want to be able to search it. You can do that on the command line, but it would be much better to have an interactive system. You could start with Kibana for that but you are free to use other open source tools for your Graphical User Interface(GUI). Note that the each article in the collection contains different fields. Make sure that a user can decide which field to search (Hint:one of the fields is the publication date of the article).
• Engineering a Complete System - The final system should allow a user to have control over all the individual components, so inthe final result we will have a complete search engine, not disperate code.
You will have noticed that the percentages above only add up to 80%. This is because one of the important aspects of the project is that your work should be well documented and your code well commented. 20% of your mark will come from this. The report should contain:
• Instructions for running your system
• Screenshots illustrating the functionality you have implemented
• Design and design decisions/justifications of your overall architecture
• A description of the document collection you have chosen
• Discussion of your solution focussing on functionality implemented and possible improvements and extensions.

Attachment:- Information Retrieval.rar

Attachment:- Metadata.rar

Reference no: EM132801237

Questions Cloud

Define organizational behavior : Define organizational behavior. Describe how different components of organizational behavior are used within an organization.

Determine the amount of deposits in transit : Patry Corp. deposits all receipts intact and makes all payments by cheque. Determine the amount of deposits in transit and outstanding cheques at May 31

What is the Accounting Department cost : The Maintenance Department's costs of $300,000 are allocated on the basis of machine hours. What is the Accounting Department cost

How cash should be distributed during the entire course : Partners A, B, C and D share profits in the ratio of 3:3:1:1, respectively. How cash should be distributed during the entire course of liquidation

CE706 Information Retrieval Assignment : CE706 Information Retrieval Assignment Help and Solution, University of Essex - Assessment Writing Service - growing urgency for these approaches

HI6025 Accounting Theory and Current Issues Assignment : HI6025 Accounting Theory and Current Issues Assignment Help and Solution, Holmes Institute - Assessment Writing Service

Understanding individual behavior in a social context : Social psychology is about understanding individual behavior in a social context. Social psychologists, therefore, deal with the factors that lead us to behave

Receptive fields of cat optic nerve and lgn neurons : What new properties were associated with the discovery of these receptive fields? How did these properties require that the definition of receptive field be cha

Importance of a multicultural perspective in crisis interven : Give two examples of crisis situations in which an understanding of another culture will enable you to more effectively respond.

User Account

All Pages

CE706 Information Retrieval Assignment

Reference no: EM132801237

Reference no: EM132801237

Questions Cloud

Reviews

len2801237

Write a Review

Other Subject Questions & Answers

Cross-cultural opportunities and conflicts in canada

Sociology theory questions

A book review on unfaithful angels

Disorder paper: schizophrenia

Individual assignment: two models handout and rubric

Developing strategic intent for toyota

Gasoline powered passenger vehicles

An aspect of poverty in canada

Ngn customer satisfaction qos indicator for 3g services

Prepare a power point presentation

Information literacy is important in this environment

Associative property of multiplication

Assured A++ Grade

Academics

Major Subjects

Majors

Get In Touch

TERMS & POLICIES

HELP & SUPPORT