Reference no: EM132578031 , Length: 7 pages
Part 1:
Write a script or application that takes a start page as a command line parameter. From there it crawls a website by following links.
The output of your program should be to screen or to a file. The output should be each page in the site followed by the pages it links to.
Part 2:
In this part, you must parse each page in the site and
Tokenise the TEXT of each document, not any irrelevant details.
Create a stop word list made up of the words that appear in every document.
Use Porter Stemmer to stem the contents Parse the documents and create a TF.IDF weighted Index
Once this is done, you must provide a query application that uses the vector space model to answer a query typed in by the user and return a ranked list of documents that satisfy the information need.
You must provide two parts to this so that the setup does not need to be re-done every time a new query needs to be run. Time is not being graded but it should run within an acceptable time (say 10 minutes..).
Part 3:
Assign the start page (mainpage.html in the sample corpus) a weight of 100 and assume a damping factor of 0.75, calculate the Pagerank of each page in the site after 20 iterations.
Your script or application should take a start page as an argument at the command line and output a file that lists each page and the rank of that page at iteration 20.
Attachment:- IRWS Repeat Assignment.rar