CSCE 822 - Data Mining and Warehousing Assignment

Reference no: EM132400011

CSCE 822 - Data Mining & Warehousing Assignment - University of South Carolina, USA

Attached melb_data.csv file is the Snapshot of Tony Pino's Melbourne Housing Dataset. Do the following data preprocessing and apply KNN and RandomForest algorithms to classify the property prices.

1. Fill the missing values in the dataset using imputation approaches as we talked in class. You can use the scikit-learn's module

from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer()

data_with_imputed_values = my_imputer.fit_transform(original_data)

The default imputer use mean values to fill the missing values. You can try other imputation method as well.

2. Replace the categorical/nominal attributes with one-hot-encoding.

You can use Category Encoders package for use with scikit-learn in Python.

Read this blog for more approaches for data encoding - Smarter Ways to Encode Categorical Data for Machine Learning.

3. Install Weka system on your computer

Sort all the property samples by the property prices and divide the samples equally into 5 categories/classes: Top value, High value, medium value, low value, bottom value.

Apply the KNN algorithm of Weka with K=5 to 10 to classify the property instances into 5 classes. Calculate the accuracy for each K values.

Apply RandomForest algorithm of Weka and report the performance.

You need to split the whole dataset into training (66% samples) and testing datasets (34% samples). Do the random splitting 10 times to calculate the average accuracy.

from sklearn.model_selection import train_test_split

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.2, random_state = 0)

	K=5	K=6	K=7	K=8	K=9	K=10
KNN	Average accuracy	...
RandomForest	Average accuracy

Write report to discuss the performances of KNN and randomforest. You are encouraged to compare the performance of different missing value imputation methods or the categorical encoding methods.

Attachment:- Data Mining & Warehousing Assignment Files.rar

Reference no: EM132400011

Questions Cloud

CSCE822 Data mining Homework : CSCE822 Data mining Homework - Deep learning application for microscopy image classification. Download sample code, and run the code, and report your training

Whistle-blowing-motivation-decentralization-group norms : Pick one of the following terms for your research: Whistle-blowing, motivation, decentralization, group norms, or needs.

Reports produced by council saudi chambers : According to recent reports produced by the Council of Saudi Chambers, healthcare turnover is on the rise within the Kingdom of Saudi Arabia

PMBA6020 Accounting for Decision Making and Control : PMBA6020 Accounting for Decision Making and Control Assignment Help and Solution, Nanyang Business School, Singapore- Assessment Writing Service

CSCE 822 - Data Mining and Warehousing Assignment : CSCE 822 - Data Mining and Warehousing Assignment Help and Solution - University of South Carolina, USA. Fill the missing values in the dataset

Problem - Regression using SVR or Random Forest : Problem 2: Regression using SVR (Support vector regression) or Random Forest. Develop a regression model that can beat a theory model

Explain what might cause process to be out of control : What are some patterns that would indicate that the process is out of control? Additionally explain what might cause a process to be out of control

Organization values support the practice mission and vision : Values/Mission/Vision: How can you ensure that the organization's values support the practice's mission and vision?

GB601-about areas of success-opportunities for improvement : GB601- How did the numbers provide information to you as a base about areas of success, opportunities for improvement?

User Account

All Pages