Compare the similarity of any two documents

Assignment Help JAVA Programming
Reference no: EM132509407

Programming Assignment

Task 1:

Write a script or a program that reads a text file, pre-processes it and saves the results into a new file.

The text file contains documents, one document per line. Each document is one or several sentences. Your program should take three parameters:
input file name, output file name, stopword list
It should pre-process documents so that they can be later used to create an inverted index. Basic pre-processing should consider:
punctuation tokenization
lower-casing/upper-casing / punctuation / numbers

stop word removal (a list will be provided, one word per line in a file called stopwords.txt)

stemming must use one of the Porter stemmer libraries you can find here:

Task 2:

Write a script or a program that reads a text file of pre-processed documents, creates a Term Document Incident Matrix and an inverted index.
and saves each to files called TDIM.TXT and InvIndex.txt Your program should take one parameter:
input file name and the input file should be assumed to be in the same directory as the application.

The Input file is made of pre-processed documents, Each document will have an Identifier/Title separated from the content by a TAB character.

InvIndex.txt: Each line of the output file should at least contain the term and title of all documents that the term occurs in.

TDIM.txt: Each column must be separated by a TAB character. (see sample file)

Task 3

Write a script or a program that reads a text file containing a TDIM as defined in the previous task, and uses it to produce a TF.IDF weighted matrix which it can then use the vector space model (VSM) to compare the similarity of any two documents and return the Cosine Similarity Measurement.

Your application should take the name of the TDIM file as an argument and the document identifiers of the two documents to be compared.

It is up to you how you manage the logic of this process but the speed of your script/application will be measured as the total time for three separate runs and comparisons using the SAME CORPUS.

Attachment:- Programming Assignment.rar

Reference no: EM132509407

Questions Cloud

Determining the recapitalization plan : Your firm is currently 100% equity financed. The CFO is considering a recapitalization plan under which the firm would issue long-term debt with a yield
What are the repercussions for the patients : It tells the story of three intensive care unit (ICU) nurses at Sanai-Grace Hospital in Detroit. This is a large city ICU where the ideal case load is one nurse
Managerial finance cost capital : Adams, Incorporated would like to add a new line of business to its existing retail business. Construct annual incremental operating cash flow statements.
Determine the market value of a comparable firm : Determine the market value of a "comparable" firm based on the following information: value of target firm = $4,000,000
Compare the similarity of any two documents : Write a script or a program that reads a text file of pre-processed documents, creates a Term Document Incident Matrix and an inverted index.
Describe what happens to the organizational climate : Resistance to change is a normal everyday aspect in the workplace. Note what happens to the organizational climate when this resistance occurs and any tactic.
Compute the amount of overhead cost allocated : Compute the amount of overhead cost allocated to each product and the profitability of each product using the activity based costing approach.
Discuss the given statement related to gates suggestions : Watch the video from Mr. Bill Gates, The next outbreak? We are not ready and Discuss Mr. Gates' suggestions for making us better prepared for the next epidemic.
How much did you borrow for house : How much did you borrow for your house if your monthly mortgage payment for a 30 year mortgage at 6.65% APR is $1,200?

Reviews

Write a Review

JAVA Programming Questions & Answers

  Develop a simplified master-worker framework

In this assignment, you are to develop a simplified master/worker framework. Part 1: Java TCP Streaming, Multi-threading and Object Serialization Programming

  Design a java application for keeping track of students

Design a java application for keeping track of students and employees at GSU.You may complete this individually or in a group of no more than 3 people

  Write a program that lets the user click on the panel

Write a program that lets the user click on the panel to dynamically create points. Initially, the panel is empty. When a panel has two or more points, highlight the pair of closest points. Whenever a new point is created, a new pair of closest po..

  Your project as a programming consultant is to create a

your project as a programming consultant is to create a program that develops an amortization schedule.nbsp your

  Design and implement in java a bookshop management

System Development for Business Processes - CE00351-5 - design and implement in Java a Bookshop Management System corresponding to the attached scenario. You are not required to implement the entire scenario.

  Write java program that display a welcoming message

Write a complete Java program with the following specifications: Display a welcoming message such as: "Welcome to Cubing Program!" Prompt the user twice to enter two positive whole numbers, indicating starting and finishing values, first one being s..

  Compare the complexity based on the running time

Write a driver java program, based on the provided source code, to rum those 4 algorithms for the Maximum Subsequence Sum problem and compare the complexity based on the running time.

  Create a class named movie that can be used with a video

extend above with a Rental class. This class should store a Movie that is rented, an integer representing the ID of the customer who rented the movie, and an integer indicating how many days late the movie is.

  Write the method in java

2.char mostOftenIn(String s) that for a given string s, returns the character that occurs most often in s.

  Explain what will happen if the lines indicated

Consider the code which implements the add method in the DList class from the course videos

  What is the java method doing

What is the following method doing? Please clearly describe it.

  Create an application using html5

Create an application using HTML5, CSS, and JavaScript that has an image or figure of a trashcan

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd