Reference no: EM132314589
Assignment
A medium-size Australian company (imaginary) has given you one year of data about the online purchases that their customers have made. They want you to analyse the data using statistical and machine learning techniques and produce:
a prediction algorithm for predicting how much money each customer is likely to spend in a year;
a classification algorithm for predicting which customers will be 'big spenders';
some recommendations on what marketing strategy they should use to attract more 'big spender' customers.
Instructions
Follow all the instructions in this notebook to complete these tasks. Note that some cells contain 'assert' statements - these will automatically mark your work so that you can check that you have done the preceeding steps correctly. (If they give errors, then go back and correct your previous work until you fix those errors. Once those 'assert' cells execute without errors, you know that you have achieved the marks for that step.)
When you have finished, this notebook is the only file that you will need to submit to Blackboard.
Note: If you want some space to try out some Python code of your own, feel free to add extra cells into this notebook. Just make sure that before you submit your notebook, that those extra cells execute without error, or that you delete them before submitting.
Overview
You have five sections to complete in this Notebook
Part A: Load and Clean Data
Part B Data Exploration
Part C: Predicting Spending Levels
Part D: Predicting Big Spenders
Part E: Business Recommendations
Part A: Load and Clean Data
Save your CSV data file into the same folder as this notebook.
Write Python code to load your dataset into a Pandas DataFrame called 'sales.
Part B Data Exploration
In this section, you will explore the data statistically and visually, to get a feel for what kinds of data you have, and how much people are spending on your web site.
B.1 Data Inspection
Start by using the Pandas **describe()** function to analyse all the numeric columns of your 'sales' DataFrame. Spend some time looking at this and making sure that you understand the average (mean) and range (min and max) of each column.
Data Inspection Questions
In the next cell, write your observations about the \"SpendValue\" and \"Purchases\" columns. For each column, say what the average value is and discuss what that means in terms of your sales to an average person. Also discuss the min and max values.
Based on the \"SpendValue\" column, explain how much your \"big spenders\" (the top 25% percent of your clients) are spending each year. This will be a range of values, such as from 1000 to 2000 dollars.
Your discussion must all be in the next cell.
Add three level-2 headings in that cell to break your discussion into topics: \"Purchases column\", \"SpendValue column\", and \"Big Spenders\.
B.2 Differences between States
We want to know where most of our customers live and whether customers from certain areas spend more or less than average. Write some Pandas code to calculate and display the total **number of customers** in each Australian state (NSW, QLD, VIC, etc.) and their average **SpendValue**.
Hint: you could do this by *grouping* your 'sales' table, or by *looping* through all the states, or several other ways.
Question:
Discuss these graphs and explain your conclusions.
For example, are there *significant* differences in the average spend in different states? Are our customer spread evenly across Australia, or concentrated in particular areas?
Write your answer in the next cell, and give reasons for your conclusions?
Part C: Predicting Spending Levels
Using the LinearRegression function from the Scikit-Learn library (sklearn), build a machine learning model for predicting the expected SpendValue for a customer.
Measure the performance of your model using 10-fold cross-validation with a test set size of 20% and print various measures of how accurate your predictions are.
Analysis of Results
Print out the linear regression coefficients for all the input features, so that you can see which ones are more significant and which ones are unimportant.
Hint 1: Since the scale of the input features is so different (0-1 for sex, 0-160000 for income, etc) multiply the linear regression coefficients by the average value of the corresponding column, to see how many dollars that column contributes to the total predicated-spend answer.
Hint 2: Could you graph the predicted and actual spendvalues of the test data, to visually see how good the linear regression results are?
Discussion:
Discuss your conclusions about this linear regression model (in the next cell). Which input features are most significant?
Part D: Predicting Big Spenders
In this section we want to build some machine learning models predict if a new customer is likely to be a big spender or not. This will be a binary outcome (yes or no), so we can use machine learning *classification* algorithms.
Remember that our definition of 'Big-Spender' is that it is a client whose annual spending level (**SpendValue**) is in the top 25% of our clients. So the exact dollar cutoff for big spenders will be different for each student, as each of you are working for a different company and are using a different dataset.
Choose two classification algorithms. Use each one to build and then evaluate a 'big-spender' prediction model.
Discussion:
Discuss your conclusions about your two classification models (in the next cell).
Which classification algorithm gives the more accurate results?
How accurate are the results from your best classifier?
Part E: Business Recommendations
The company you are doing this analysis for wants some recommendations from you about how to find new customers who are likely to be big spenders. They are wondering if they should focus their advertising on a particular gender? Or people in a given state, such as Victoria, or NSW? Or aim at demographic groups who have high income level or medium income levels? Or other strategies? What recommendations will you give them?
Write about 100 words describing your conclusions from your analysis, and your recommendations for the best strategy for attracting new big-spender customers.
Attachment:- Assignment File.rar