Problem - Regression using SVR or Random Forest

Assignment Help Other Subject
Reference no: EM132400010

Problem 1: Classification of e-tailer customers (Real-world problem) using Support vector machines and Random forest. You can use weka or Scikit-learn python programming.

Objectives: E-commerce Customer Identification (Raw). Try to get the best performance using preprocessing, feature selection, data balancing, and parameter tuning.

The task involves binary classification to determine customers of the e-tailer. The training data contains 334 variables for a known set of 10000 customers and non-customers with a ratio of 1:10, respectively. The test data consists of a set of examples and is drawn from the same distribution as the training set.

Data: The feature data is train.csv and the label data is train_label.csv with corresponding labels for the records in train.csv. The test.csv is the test data.

Preprocessing steps to do:

You may use excel or write a simple script to merge the feature data file with label data file and save as csv file, then you can import into weka system.

Missing values: Check if there are any missing values inside the dataset, if so, use Weka's missing value estimation filter to estimate the missing values to make the data complete.

Normalization: since the features have very different value ranges, apply weka's normalization procedure to make them comparable.

Attribute/Feature selection: Since there are 334 features in the dataset, it may be useful to use some feature/attribute selection to reduce the dataset before training classifiers. Select one method (weka->filters->supervised->attribute->attributeSelection) to do feature selection. Describe your selected method and explain how it works briefly.

Hint 1: after you import the merged csv file into weka, the class label 1/0 is regarded as numeric value rather than nominal labels. You need to use the weka->filter->unsupervised->attribute->numeric2Nominal filter to convert that column to nominal class. (you need to specify which column is your class label to apply this conversion) Also note that weka take first line as feature names!! So need to add a line of feature names.

Hint 2: The dataset is a severely unbalanced dataset. You may want to balance the data before training the classifier.

Hint 3: if your training data has been applied a set of normalization or feature selection, you need to do the same with test dataset, otherwise the feature values are not consistent, and you will get absurd results on test data.

Hint 4: The best AUC value for this problem is 0.6821. See what u can get.

Experiments to do:

1) Experiments on the training dataset

You will need to build a classifier using a SVM and Random Forest algorithms to classify the data into customers and non-customers and evaluate their performance.

Pick one decision tree algorithm from Weka such as J48graft and describe it. (There are many decision tree algorithms)

Explain pre-processing filters in the table below. Run your decision tree algorithm with the default parameters. This is to learn how the preprocessing affects performance.

Write down the corresponding performance measures for class 1 (customer) in the following table for each processing.

All measures are based on 10-fold cross-validation results (except the last row). Put your results in Table 1 (below).

2) Use your best classifier you trained in step one, predict the class labels for the test dataset test10000.csv. Save your prediction labels into the predict.csv file.

Write a program to calculate precision, recall, MCC using the true labels in the test10000_label.csv and the predicted labels in your predict.csv file.

Table 1: Comparing performance of classifier on test dataset

Algorithm performance

Precision

Recall

MCC

ROC area

SVM (10-fold CV)

 

 

 

 

Random Forest (10-fold CV)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Result on test data

 

 

 

 

Report requirement:

1) Describe the preprocessing methods you used in the above experiments: missing value estimation, normalization, attribute selection, random forest.

2) Report the performance results in Table 1.

3) Submit the program to calculate the performance measures: Precision, Recall, MCC from two label files.

Problem 2: Regression using SVR (Support vector regression) or Random Forest

The problem here is to develop a regression model that can beat a theory model.

Attached thermal-data.xlsx contain a dataset for material thermal conductivity.

Develop two regression programs (one is SVR, the other can be RandomForest) to predict the thermal conductivity (y-exp) using the all the features before it. (V,M,n,np,B,G,E,v,H,B',G',ρ,vL,vS,va,Θe,γel,γes,γe,A,).

Report the MSE, RMSE, MAE, R2 of 10-fold cross-validation. Compare the MSE, RMSE, MAE, R2 of the theoretical model using the values in column y-theory.

Try to tune your parameters of the models to achieve the best performance.

Plot the final scatter plot for your best model/result. The better the points are around the diagonal line the better your model is.

235_figure.png

Attachment:- Assignment File.rar

Attachment:- Data Files.rar

Reference no: EM132400010

Questions Cloud

Whistle-blowing-motivation-decentralization-group norms : Pick one of the following terms for your research: Whistle-blowing, motivation, decentralization, group norms, or needs.
Reports produced by council saudi chambers : According to recent reports produced by the Council of Saudi Chambers, healthcare turnover is on the rise within the Kingdom of Saudi Arabia
PMBA6020 Accounting for Decision Making and Control : PMBA6020 Accounting for Decision Making and Control Assignment Help and Solution, Nanyang Business School, Singapore- Assessment Writing Service
CSCE 822 - Data Mining and Warehousing Assignment : CSCE 822 - Data Mining and Warehousing Assignment Help and Solution - University of South Carolina, USA. Fill the missing values in the dataset
Problem - Regression using SVR or Random Forest : Problem 2: Regression using SVR (Support vector regression) or Random Forest. Develop a regression model that can beat a theory model
Explain what might cause process to be out of control : What are some patterns that would indicate that the process is out of control? Additionally explain what might cause a process to be out of control
Organization values support the practice mission and vision : Values/Mission/Vision: How can you ensure that the organization's values support the practice's mission and vision?
GB601-about areas of success-opportunities for improvement : GB601- How did the numbers provide information to you as a base about areas of success, opportunities for improvement?
Explains statistical process control analysis : Explains Statistical Process Control Analysis ( remember to explain your analysis and attach Excel spreadsheet); the company that will be used is Amazon.

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd