Reference no: EM132315443
Assignment Task
This assignment consists of two deliverables, being:
• PySpark source code in Jupyter Notebook format. All Jupyter notebook files and the date set file relating to this assignment should be contained within a folder named: Task 3- Your Name-Student Number, the folder is then to be zipped and uploaded to blackboard.
• A report. The report must be uploaded as a separate file.
Part I - PySpark source code
Important Note: For code reproduction, your code must be self-contained. That is, it should not require other libraries besides PySpark environment we have used in the workshops.
In this component, we need to utilise Python 3 and PySpark to complete the following data analysis tasks:
1. Exploratory data analysis
2. Recommendation engine
3. Classification
4. Clustering
You need to choose a dataset from Kaggle to complete these tasks. Remember to include the data set file in you source code submission.
Task I.1: Exploratory data analysis
This subtask requires you to explore your dataset by
• telling its number of rows and columns,
• doing the data cleaning (missing values or duplicated records) if necessary
• summarising 3 columns with plots (e.g. bar chart, histogram, boxplot, etc.)
Task I.2: Recommendation engine
This subtask requires you to implement a recommender system on Collaborative filtering with Alternative Least Squares Algorithm. You need to include
• Model training and predictions
• Model evaluation using MSE
Task I.3: Classification
This subtask requires you to implement a classification system on Logistic regression with LogisticRegressionWithLBFGS class. You need to include
• Logistic Regression model training
• Model evaluation
Task I.4: Clustering
This subtask requires you to implement a clustering system on K-means. You need to include
• Model training
• Model evaluation
Part II -Report
You are required to write a report explaining the theory underlining the key concepts around the design and implementation of your code. Finally, you are to include all code in .py format in the appendices of the report. Note that the code will not count towards the word count.
Your report should follow the following template:
Table of Contents
1.0 Introduction
Key System Concepts
Machine learning pipelines. Explain key steps in machine learning pipelines and how they were applied in your code.
Collaborative filtering. Explain Collaborative filtering principles and how they were applied in your code.
Logistic regression. Explain Logistic regression principles and how they were applied in your code.
K-Means. Explain K-Means principles and how they were applied in your code.
4.0 Conclusion References
Appendices
Report Format
Your report should be 1000 ~ 1500 words.
The report MUST be formatted using the following guidelines:
• Title Page - Must not contain headers, footers, or page numbering. Include your name as the report's author.
• Header - Report title
• Footer - your name and the page number
• Paragraph text - 12 point Calibri single line spacing
• Headings - Arial in an appropriate type size
• Margins - 2.5cm on all margins
• Page numbering
• Introduction and onwards to use conventional numerals (1, 2, 3, 4) starting at page 1 from the introduction.
• The report is to be created as a single Microsoft Word document (version 2007 or later). No other format is acceptable and doing so will result in the deduction of marks.
Attachment:- Data Science Practice.rar