1-Create ir3.py based on ir2.py
2-Repeatedly prompt the user for a query (if they enter "q", then quit)
3-Find the terms in the query, and calculate the appropriate weight for each query term
• (hint:) : weight for query = log2 (total number of doc / number of times the word appear in all the Doc).
• weight for query =((log( float( len( documents) ) / docfreq [ term ] ))/log(2))
• the Output for the query ""quick brown vex zebras""should be :
Doc name
|
Term
|
Weights
|
Q
|
Quick
|
0.58
|
Q
|
Brown
|
1.58
|
Q
|
Vex
|
0.58
|
Q
|
Zebras
|
1.58
|
4-Calculate the similarity for each query/document pair
(hint:) : the similarity= Q * D1 / |Q||D1| for example :
5-List the documents in order of decreasing similarity to the query, along with their similarity value
• Your results for "quick brown vex zebras" should be:
D1.txt 0.42, D3.txt 0.33, D2.txt 0.08
7-Make sure that querying "quick brown vex zebras" a 2nd time gives the same result
8-What is the result for the query "quick brown vex lion"?
Genral Hint :
• For user Input :
while True:
querystring = raw_input( '\nEnter query (q to quit): ' )
if querystring == 'q':
print '\nGoodbye!\n'
break
...do more stuff...
• To sort a dictionary in descending order by value from operator import itemgetter
items = results.items()
items.sort( key = itemgetter(1), reverse=True )
for (document, ranking) in items:
print document, "%.2f" % ranking