Reference no: EM132717352
CMM539 Information Retrieval Systems - Robert Gordon University
Recommender Systems
Overview
All files needed for the coursework are on Moodle in file Coursework2Resources.zip.
You will produce one or more RapidMiner files in each of the three stages of the coursework, as well as answering the questions in a single word document. All the files should be stored in a single folder and zipped before submitting to Moodle.
Your tasks are divided into three stages:
Stage 1: crawl the web to retrieve a set of web pages [1 grade]
Use RapidMiner to crawl the web and collect web page content from the specified site.
Stage 2: build a content-based text recommender [4 grades]
First use RapidMiner to process and index the supplied set of text data, then build a content- based text recommender system, and finally answer questions 2(a) & (b) (~700 words).
Stage 3: build a recommender system to predict ratings [4 grades]
Use RapidMiner to create a collaborative filtering recommender system, and then answer questions 3(a) & 3(b) (~900 words).
You are expected to submit the following 5 files via Moodle:
• question answers in a Word or pdf document: questionAnswer.docx (.pdf);
• RapidMiner exported .rmp file with your crawl process: crawl.rmp;
• RapidMiner exported .rmp file with your text processing: processing.rmp;
• RapidMiner .rmp file with your text recommender system: contentRecommend.rmp;
• RapidMiner .rmp file with your collaborative recommender: itemRecommend.rmp.
Combine the files into a zip archive named "your_matric_number-surname.zip" and submit via Moodle.
Note: The suggested word counts are approximate and do not include any diagrams, tables or graphs you may wish to include.
Detailed Instructions
Stage 1:
You will set up a loop to crawl the first 20 pages on the Hot UK Deals website (www.hotukdeals.com), process each page to generate a term vector showing term frequency, and then write the page vectors in the form of a term/document matrix as an Excel file.
• you should pre-process the retrieved files to include tokenize, case folding, stemming, and stop word removal operators, as you think appropriate;
• use term occurrence to generate the page's term frequency vector by identifying the occurrence of each term;
• use a filter and / or prune method to reduce the number of attributes to about 100, along with appropriate default values for any other parameters; and
• finally, add an additional operator to write the term document matrix as an Excel file named, crawledPages.xlsx. (Note you do not need to submit crawledPages.xlsx).
Run and test the process.
Export the crawl process as crawl.rmp and save to your submission folder.
Stage 2a:
You will create a new process to read text data from a csv file and then process the text. Use RapidMiner to read the supplied csv file (abstracts.csv) included in coursework 2 resources folder, and then process the documents from data to generate a vector representation for each abstract. The data contains 25 sample abstracts, each with an id, title and description. You should use the text from the title and description to generate the vector representation. The document vectors should use tf*idf as the term weighting. The text of each abstract needs pre-processing and should: be tokenized; be filtered to remove short and long terms; be transformed to lower case; have English stop words removed; and have tokens stemmed. Place a note in the process diagram specifying any parameters chosen.
Run and test the process.
Export the process as processing.rmp and add to your submission folder.
Stage 2b:
Now, applying the text processing method from part a, you should build a similarity- based content recommender system to recommend 9 abstracts to a user. You are given a user profile in the form of abstracts that the user has already read in the file abstractsRead.csv (included in coursework 2 resources folder), which shows that the user has already read doc5 and doc6. The 9 most similar documents identified using Cosine similarity should be recommended. You should give a higher importance to the title by allocating an attribute weight of 2.0 to terms from the title with a weight of 1.0 to terms from the description.
Hints: There are a number of ways of building this recommender system, but you may want to: read the csv file and create 2 copies of the data; process one copy to generate a vector representation of each unread abstracts, and process the other copy to be a vector representation of the already read abstracts as a user model. Calculate the similarity between the user model and each unread abstract, and then rank the 9 most similar unread abstracts.
Export the final process as contentRecommend.rmp and add to your submission folder.
Question 2a) Traditional IR approaches focus on lexical rather than syntactic or semantic similarity. What is meant by each of these three terms and why can ignoring syntactic or semantic similarity be a problem for retrieval systems. Discuss at least 3 approaches that could be employed to introduce syntactic and/or semantic similarity.
Question 2b) What are the key advantages and disadvantages of a content-based recommender compared to a collaborative filtering approach in relation to the abstract recommendation task encountered here.
Stage 3a:
You will create a new process in RapidMiner to predict the user rating for specified unrated items using a collaborative filtering approach. You should build and test your model on the rated items using mean average error (MAE) as your performance indicator.
• You are given a sparse utility matrix (itemRatings.csv) in the lab resources, which shows in total about 7000 ratings given by 21 users for 5000+ news items.
• Each row in the data is a rating entry containing 3 attributes: rating, item, and user respectively.
• Ratings are given in the range 1 to 5, but with a "?" for particular item / user combinations on which we want you to predict the rating.
• Initially build an item-based recommender system using the Item k-NN operator with Pearson Correlation coefficient to identify 3 similar items;
• Run the process and ensure performance is measured by MAE on the predictions made for known ratings.
Question 3a) What changes could you make to the design that might improve performance? Experiment with different parameters and / or designs to see if you can improve the performance of the process. What did you try and why? What was the MAE for each experiment? Discuss and explain your results.
Question 3b) There are a number of different criteria on which to evaluate the performance of recommender systems. In this experiment we used MAE as our performance measure. What is MAE and which evaluation criteria does it measure? Can you identify any problems with using MAE? What other evaluation criteria may also be relevant for a news article recommender system?
Attachment:- Information Retrieval Systems.rar