Reference no: EM132643438
BUS5PA Predictive Analytics - La Trobe University
Assignment - Building and Evaluating Predictive Models using SAS Enterprise Miner
Objective:
a) Demonstrate knowledge of building different types of predictive models using SAS Enterprise Miner
b) Demonstrate skill and knowledge in applying predictive models in a real-life predictive analytics task
c) Relate theoretical knowledge of predictive models and best practices to application scenarios
Business Case - Predictive Model for Vehicle Price Prediction
Beta (Pvt) Ltd is an Australian online car sales platform for providing an effective car buying and selling service. In order to help boost sales transactions, the management of beta is in the process of building a car price estimation system to help second-hand car sellers to sell their cars at the best price.
Beta management is very keen to trial predictive modeling for this task and has gathered a historic car sales dataset from a publicly available data repository. The dataset contains 21 variables describing previously sold cars. The attributes include the selling price of cars, year, odometer reading, fuel type, condition, location, etc. The list of attributes and their descriptions are given below.
Variable
|
Description
|
id
|
Unique Id of the record
|
region
|
Region
|
price
|
Price
|
year
|
Launch year
|
manufacturer
|
Manufacturer
|
model
|
Model
|
condition
|
Overall condition of the vehicle
|
cylinders
|
Number of cylinders
|
fuel
|
Fuel type
|
odometer
|
Odometer reading
|
title_status
|
Condition - whether the vehicle is free from accidents/ repaired/ rebuilt etc.
|
transmission
|
Transmission type
|
vin
|
Vehicle identification number
|
drive
|
Drive type
|
size
|
Size of the vehicle
|
type
|
Vehicle type
|
paint_color
|
Paint color
|
county
|
County
|
state
|
State
|
lat
|
Lat
|
long
|
Long
|
The management of Beta.com Ltd. is considering you as an external consulting group to outsource the task to develop a reliable predictive model to predict the selling price of the cars, using the aforementioned historic dataset. Beta has provided you with a sample data sets of BMW, Mercedes, Toyota and Honda cars to build separate price-prediction models. They also wish to compare and contrast the pricing models of these four car brands.
PART A
You have to select one dataset from the four datasets provided. (Each group member should have a different dataset). Based on the selected vehicle dataset, you are required to build different predictive models, compare and contrast which is the best model for the selected dataset. You are also provided with a scoring dataset which you need to use for price prediction.
1. Setting up the project and exploratory analysis
a. Create a new project and create a data source based on the selected dataset.
b. Carry out a data exploration by using a StatExplore Node. Explain your findings with regard to your vehicle dataset.
c. Create a Data Partition with 70% of the data for training and 30% for validation.
2. Decision tree-based modeling and analysis
Carry out the following modeling tasks for the selected vehicle dataset.
a. Create two Decision Tree models. Use two-way and three-way splits to create the two separate decision tree models.
For each decision tree,
I. How many leaves are in the optimal tree?
II. Which variable was used for the first split?
III. What were the competing splits for this first split?
b. Which of the decision tree models appears to be better? Justify your answer.
c. Refer to the selected decision tree model in part (b) and
I. Identify leaf nodes which have good predictive performance (two leaf nodes) and poor predictive performance (two leaf nodes).
II. Provide justifications for your selections
III. Write down the rules for the pathways leading up to each selected leaf node.
Regression-based modeling and analysis
a. In preparation for regression, is any missing values imputation needed? If yes, should you do this imputation before generating the decision tree models? Why or why not?
b. Use an Impute node connected to Data Partition node to handle missing values. Which variables have been imputed?
c. Are there any ordinal variables? Use the replacement node to assign relevant values.
d. Conduct data exploration to select the best variables for the model. Explain your findings.
Hint: You can connect the ‘Variable Selection' node in the ‘Explore' tab to the datasource and observe which variables have been picked as the selected variables. To manually change the variables, go to ‘Manual Selection' in the properties panel and adjust role of the variables.
e. Create a Regression model using the set of variables you identified as suitable in part c. You can choose the stepwise selection and use validation error as the selection criterion.
f. Run the Regression node and view the results.
I. Which variables are included in the final model? Explain what this means to the vehicle sales organization (very briefly).
II. What is the validation ASE? What does this mean in a predictive model?
4. Model Comparison and Scoring
a. Use the model comparison to compare and contrast the results from the decision trees and regression based analysis. Describe and justify how you ascertained the better model.
b. Compare and contrast the best model selection for the car brand you selected. Would it have been sufficient to use only one modeling technique (decision tree or regression)? Provide justifications for your answer.
c. Use the scoring data sets to score the best model for your vehicle brand. Explain the output using plots.
PART B:
As an important extended step in the predictive modelling process, Beta Management is interested in comparing and contrasting different predictive models specific to different car brands.
In this exercise, your team is expected to carry out a comprehensive analysis based on the predictive models created for the four car brands provided (Toyota, Honda, BMW, Mercedes). As a team you have to compare and contrast the outcomes of predictive models for the four car brands and create a report for Beta Management.
You may consider the following points:
a. Which variables were used in the predictive models to determine the price of the four brands? Discuss further comparing the feature importance of four brands (eg: What do the differences in features/variables say about the buyers interest in different models?).
b. Compare the best selected predictive models for the four brands. Do you recommend decision trees or regression? Outline reasons.
c. As a team of data analysts, what are your suggestions to improve the car price prediction models of Beta Management?
Attachment:- Assignment_Data.zip