Reference no: EM132548991
COS60008 Introduction to Data Science - Swinburne University of Technology
Project Overview
You are provided with a dataset chocolate.csv on chocolate bars. Your goal is to develop a machine learning model which takes the properties of a specific chocolate bar (e.g. the percentage of cocoa, the origin of beans), and output the rating. The dataset contains the relevant information of a number of chocolate bars, along with expert ratings as the ground truth.
Data source
The dataset is from Brady Brelinski, Founding Member of the Manhattan Chocolate Society. The data is also used in a Kaggle competition.
Columns description
• Company (Maker-if known): name of the company (string).
• Specific Bean Origin: the geographical origin for the chocolate bar (string).
• REF: a value indicating when the review was entered in the database. A higher value indicates more recently entered (integer).
• Review Year: the year of the review published (integer).
• Cocoa Percentage: cocoa percentage of the chocolate bar (string).
• Company Location: the country of the manufacturer (string).
• Rating: expert rating for the chocolate bar (float). This is the label to be predicted by the model. It is a number from 1 (lowest quality) to 5 (highest quality).
• Bean Type: the type of cocoa bean used (string).
• Broad Bean Origin: the broader geographical origin of the cocoa bean (string).
Dataset dimension
• Samples (rows): 1500
• Attributes (columns): 9 (including the target: rating)
Tasks
Your team will need to accomplish the following tasks. You should apply the suitable techniques covered in the lectures and tutorials.
1. Perform data pre-processing. This includes but is not limited to checking typos, dealing with missing values and creating dummy variables.
2. Formulate the problem as a machine learning task.
3. Select three learning algorithms based on the previous task and identify the corresponding hyperparameters if any. There must be at least one hyperparameter (to be optimised in Task 5).
4. Perform data partitioning. This will split the data into the training data and the test data. The training data will be used for model development, with the test data for performance evaluation.
5. Perform model development
o List all your learning algorithms by expanding on the hyperparameters. For example, you might select RandomForest, K-Nearest Neighbours (K-NN) and Artificial Neural Networks (ANN) as the three learning algorithms. You nominate the number of neighbours N as the hyperparameter and proposed 5 possible values (e.g. 6, 7, 8, 9 ,10). Hence effectively, you will have the following algorithms:
• RandomForest (0 hyperparameters, 1 model)
• ANN (0 hyperparameters, 1 model)
• K-NN (1 hyperparameters with 5 possible values, 5 models)
» K-NN (N=6)
» K-NN (N=7)
» K-NN (N=8)
» K-NN (N=9)
» K-NN (N=10)
o Assess each learning algorithm on the training data. For a given learning algorithm L, you will assess its validation performance as follows:
• Define an n-fold cross validation within the training data, where n is from 3 to 5.
• In each fold, identify the actual training data trData and the validation data vlData. Train L on the trData and test on the vlData to get the validation performance P.
• Obtain the average of P over all folds, which is the final performance of L.
o Select the model M with the highest validation performance.
6. Perform performance assessment
o Apply M on the test data to get the prediction.
o Calculate the accuracy and the confusion matrix.
7. Conduct other analysis to be decided by the team members. For example:
o Identify the most predictive attributes.
o Map out the chocolate rating geographically on a map.
Attachment:- Introduction to Data Science.rar