Reference no: EM132304495
Background
In this assignment, you will analyse an open dataset about a marketing campaign of a Portuguese bank to design strategies for improving future marketing campaigns. The object of this campaign is to pursuit customers to subscribe the term deposit. The marketing campaigns were based on phone calls. The dataset contains the call information with the following attributes in Table 2.1.
Task Description
We provide one IPython notebook Task2.ipynbtogether with acsvfilebank.csvat thedatasub-folder. You are required to analyse this dataset using IPython notebook with Spark packages including spark.sql and pyspark.ml.
Table 2.1: Attribute information of the dataset
| Attribute Meaning |
|
| age |
age of the customer |
| job |
type of job |
| marital |
marital status |
| education |
education level |
| default |
has credit in default? |
| balance |
the balance of the customer |
| housing |
has housing loan? |
| loan |
has personal loan? |
| contact |
contact communication type |
| day |
last contact day of the week |
| month |
last contact month of year |
| duration |
last contact duration, in seconds |
| campaign |
number of contacts performed |
| pdaysnumber |
of days that passed by after a previous campaign previous number of contacts performed before this campaignpoutcomeoutcome of the previous marketing campaign |
| deposit |
has the client subscribed a term deposit? |
Python Notebook
To systematically investigate this dataset, your IPython notebook should follow the basic 6 procedures as:
(1) Import the csv file, "bank.csv", as a Spark dataframe and name it as df, then check and explore its individual attribute.
(2) Select important attributes from df as a new dataframedf2 for further investigate. You are required to select 13 important attributes from df: `age', `job', `marital', `education', `default', `balance', `housing', `loan', `campaign', `pdays', `previous', `poutcome' and 'deposit'.
(3) Remove all invalid rows in the dataframedf2 using spark.sql. Supposing that a row is invalid if at least one of its attributes contains `unknown'. For the attribute `poutcome', the valid values are `failure' and `success'.
(4) Convert all categorical attributes to numerical attributes in df2 using One hotencoding, then apply Min-Max normalisation on each attribute.
(5) Perform unsupervised learning on df2 including k-means and PCA. For k-means, you can use the whole df2 as both training and testing data and evaluate the clustering result using Accuracy. For PCA, you can generate a scatter plot using the first two components to investigate the data distribution.
(6) Perform supervised learning on df2 including Logistic Regression, Decision Tree and Naive Bayes. For the three classification methods, you can use 70% of df2 as the training data and the remaining 30% as the testing data and evaluate their prediction performance using Accuracy.
Case Study Report
Based on your IPython notebook results, you are required to write a case study report with 500 1000 words, which should include the following information:
(1) The data attribute distribution
(2) The methods/algorithms you used for data wrangling and processing
(3) The performance of both unsupervised and supervised learning on the data
(4) The important features which affect the objective (‘yes' in ‘deposit') [Hint: you can refer the coefficients generated from the Logistic Regression]
(5) Discuss the possible reasons for obtaining these analysis results and how to improve them
(6) Describe the group activities, such as the task distribution for group members and what you have learnt during this project.
Attachment:- DATA ANALYTICS.rar