Reference no: EM132993099
ICT707 Data Science Practice - University of the Sunshine Coast
Part I - PySpark source code
Important Note:
- For code reproduction, your code must be self-contained. That is, it should not require other libraries besides PySpark environment we have used in the semester. The data files are packaged properly with your code file.
- The data sets used in the lecture slides should not be used as the data set of the assignment. This will result in 0 mark for the coding component.
In this component, we need to utilise Python 3 and PySpark to complete the following data analysis tasks:
1. Exploratory data analysis
2. Recommendation engine
3. Classification
You need to choose a dataset from Kaggle to complete these tasks.
Task I.1: Exploratory data analysis
This subtask requires you to explore your dataset by
• telling its number of rows and columns,
• doing the data cleaning (missing values or duplicated records) if necessary
• selecting 3 columns, and drawing 1 plot (e.g. bar chart, histogram, boxplot, etc.) for each to summarise it
Task I.2: Recommendation engine
This subtask requires you to implement a recommender system on Collaborative filtering with Alternative Least Squares Algorithm. You need to include
• Model training and predictions
• Model evaluation using MSE
Task I.3: Classification
This subtask requires you to implement a classification system with Logistic regression. You need to include
• Logistic Regression model training
• Model evaluation
Part II -Report
You are required to write a report with the following content:
• Provide a high-level survey on the advances of data science in the past 2 years.
• Compare the features of Spark version 2.4 that we used this semester and the new version 3.0.
• Explain your design and implementation of the machine learning parts in your code, including the following topics:
o Background of your selected data set
o For each task, which learning algorithm is used and what are its key parameters and how you set them up
o For each task, provide comments/evaluation for the model learnt
Your report should use the following template:
Table of Contents
1.0 Advancement of Data Science (500 words)
2.0 Comparison of Spark 2.4 and 3.0 (250 words)
3.0 Machine Learning Implementation (250 words)
3.1 Data set
3.2 Collaborative filtering
Features of the model, key parameters and configuration Evaluation
3.3 Logistic regression
Features of the model, key parameters and configuration Evaluation
References
Attachment:- Data Science Practice.rar