Reference no: EM132509407
Programming Assignment
Task 1:
Write a script or a program that reads a text file, pre-processes it and saves the results into a new file.
The text file contains documents, one document per line. Each document is one or several sentences. Your program should take three parameters:
input file name, output file name, stopword list
It should pre-process documents so that they can be later used to create an inverted index. Basic pre-processing should consider:
punctuation tokenization
lower-casing/upper-casing / punctuation / numbers
stop word removal (a list will be provided, one word per line in a file called stopwords.txt)
stemming must use one of the Porter stemmer libraries you can find here:
Task 2:
Write a script or a program that reads a text file of pre-processed documents, creates a Term Document Incident Matrix and an inverted index.
and saves each to files called TDIM.TXT and InvIndex.txt Your program should take one parameter:
input file name and the input file should be assumed to be in the same directory as the application.
The Input file is made of pre-processed documents, Each document will have an Identifier/Title separated from the content by a TAB character.
InvIndex.txt: Each line of the output file should at least contain the term and title of all documents that the term occurs in.
TDIM.txt: Each column must be separated by a TAB character. (see sample file)
Task 3
Write a script or a program that reads a text file containing a TDIM as defined in the previous task, and uses it to produce a TF.IDF weighted matrix which it can then use the vector space model (VSM) to compare the similarity of any two documents and return the Cosine Similarity Measurement.
Your application should take the name of the TDIM file as an argument and the document identifiers of the two documents to be compared.
It is up to you how you manage the logic of this process but the speed of your script/application will be measured as the total time for three separate runs and comparisons using the SAME CORPUS.
Attachment:- Programming Assignment.rar