Details of how you download your datasert

Assignment Help Other Subject
Reference no: EM132816120

The context of your task

Researchers, clinicians, and policy makers involved with the response to COVID-19 are constantly searching for reliable information on the virus and its impact. This presents a unique opportunity for the information retrieval (IR) and text processing communities to contribute to the response to this pandemic, as well as to study methods for quickly standing up information systems for similar future events.

The idea of this assignment is that you apply the information retrieval knowledge you acquired during this term and put it into practice. You are already familiar with Elasticsearch. You also know the processing steps that turn documents into a structured index, commonly applied retrieval models and you know the key evaluation approaches that are being employed in IR. Now is a good time to put it all together.

Scenario: In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19) . CORD-19 is a resource of over 181,000 scholarly articles, including over 80,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in information retreival and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.

Your task

This task comes in stages. Marks are given for each stage. The stages are as follows:

• Indexing (20%) The first step for you will be to obtain the dataset. Once you have done so upload a sample of 1000 articles with full text to Elasticsearch (the simplest thing is to use the first 1000 documents). You will work with the metada.csv file provided by the challenge.
• Sentence Splitting, Tokenization and Normalization (10%) The next step should be to transform the input text into a normal form of your choice. This should include the identification of sentences, bullet points and cells in tables.
• Selecting Keywords (20%) One aim of your system is to identify the words and phrases in the text that are most useful for indexing purposes. Your system should remove words which are not "useful". E.g. very frequent words or stopwords. You should also identify phrases suitable as index terms. Apply tf.idf as part of your selection and weighting step.
• Stemming or Morphological Analysis (10%) Writing word stems to the database rather than words allows to treat various inflected forms of a word in the same way, e.g.bus and busses refer to exactly the same thing even though they are different words.
• Searching (10%) Once you have indexed the collection you want to be able to search it. You can do that on the command line, but it would be much better to have an interactive system. You could start with Kibana for that but you are free to use other open source tools for your Graphical User Interface(GUI). Note that the each article in the collection contains different fields. Make sure that a user can decide which field to search (Hint:one of the fields is the publication date of the article).
• Engineering a Complete System (10%) The final system should allow a user to have control over all the individual components, so inthe final result we will have a complete search engine, not disperate code.
You will have noticed that the percentages above only add up to 80%. This is because one of the important aspects of the project is that your work should be well documented and your code well commented. 20% of your mark will come from this. The report should contain:
• Instructions for running your system
• Screenshots illustrating the functionality you have implemented
• Design and design decisions/justifications of your overall architecture
• A description of the document collection you have chosen
• Discussion of your solution focussing on functionality implemented and possible improvements and extensions.

Assigment 1

Instructions for running your system (Engineering a Complete System)
Include here instructions to run your system and control each individual component. You may include screenshots to clarify.

Indexing
Include here the details of how you download your datasert and index it including any issue that you had and how did you face it. Explain which documents have you selected for your experiments.. You may include screenshots to clarify.

Sentence Splitting, Tokenization and Normalization
Include here the details of how you did this step including any issue that you had and how did you face it. Present examples for each of the aspects where this step went well. Also include examples for when it when wrong and how you could solve it. You may include screenshots to clarify.

Selecting Keywords
Include here the details of how you did this step including any issue that you had and how did you face it. Present examples for each of the aspects where this step went well. Also include examples for when it when wrong and how you could solve it. You may include screenshots to clarify.

Stemming or Morphological Analysis
Include here the details of how you did this step including any issue that you had and how did you face it. Present examples for each of the aspects where this step went well. Also include examples for when it when wrong and how you could solve it. You may include screenshots to clarify.

Searching
Include here the details of how you did this step including any issue that you had and how did you face it. You may include screenshots to clarify.

Attachment:- Information Retrieval.rar

Reference no: EM132816120

Questions Cloud

Discuss federated architecture in cloud systems : Discuss in 500 words or more federated architecture in cloud systems. Remember that this is a cloud class not a database class.
What were three types of real estate private equity vehicles : What were the 3 types of real estate private equity vehicles discussed? What is the primary risk type pursued by illiquid open funds?
What are the side effects of the given action : Privacy is a concept that is rapidly evolving in relation to the most public of mediums, the Internet, which became even more super-charged.
Prepare all general journal entries for three bonds issued : Halsey Corporation, Prepare all general journal entries for the three bonds issued and any interest accruals and payments for the fiscal year 2019.
Details of how you download your datasert : Details of how you download your datasert and index it including any issue that you had and how did you face it. Explain which documents have you selected
Term cloning describes number of different processes : The term cloning describes a number of different processes that can be used to produce genetically identical copies of a biological unit.
What must the contributions be for derek : What must the contributions be? Assume a 9.00% interest rate. During these years of part-time work, he will neither make deposits to nor take withdrawals
Create a physical security considerations checklist : Create a physical security considerations checklist for an office building or a hospital that evaluates necessary or optional physical controls to reduce.
What is the project expected net present value : Produce a decision tree to reflect the financial decisions of Advance Technologies.

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd