Reference no: EM133707674 , Length: word count:2000
Machine Learning Applications
Assessment - Design a Text Retrieval System
Coding and Presentation
Your Task
Design a text retrieval system to find similar movies/shows based on the descriptions.
Assessment Description
We humans communicate using different languages, either by speaking or writing. Text data is abundant in the real world. It's a challenging task to work with natural languages. Your team lead has assigned you one such task of recommending movies based on the movie description.
Data
A movies/shows dataset with description is curated by pre-processing the Kaggle IMDb Movies/Shows with Descriptions dataset and is provided to you in MyKBS. You are encouraged to explore the original source.
The original dataset is pre-processed and is provided in 2 files - train.csv and test.csv. MyKBS provides you these files each containing following columns:
title: Title of the movie/show.
description: Description of the movie/show.
You are required to train a text retrieval system using the train.csv file. And test the system using the test.csv file.
Problem Statement
As an individual, you are required to download the data sets, i.e., train.csv and test.csv files from MyKBS. You must build a text retrieval system to find similar movies/shows based on the descriptions. You should systematically approach the problem by addressing the below tasks:
Load the data sets and pre-process them to fit your requirements. You must use at least two pre-processing techniques. (5 marks)
Design a text retrieval system using TF-IDF (with inverted file) algorithm. (10 marks)
Find the top 3 movies/shows matches in the train.csv based on the descriptions provided in the test.csv. (5 marks)
You are to record a 5-minute video accompanying PowerPoint slides to elaborate the approach and performance of the system using relevant metric(s). In recording this video, you will need to prepare accompanying PowerPoint slides thar are clear, concise, of the required quality and references in accordance with the Kaplan Harvard Referencing style. (20 marks)
Learning Objective 1: Explore programming functions to source, store and prepare data for machine learning applications.
Learning Objective 2: Design algorithmic models for the application of machine learning in information technology.
Learning Objective 3: Create advanced insights of strategic organisational value with the aid of machine learning.
You are required to follow the below guidelines:
You should write your Text Retrieval System code using Python 3 programming language.
The use of any Python third-party package(s) is restricted to the following tasks:
Loading the datasets. E.g., Pandas.
Any necessary text pre-processing steps. E.g., Natural Language Toolkit, etc.
Performing necessary calculations during the building of the system. E.g., NumPy.
Calculating the performance of the system. E.g., Scikit Learn, Matplotlib, Plotly, etc.