Evaluate the prediction performance on the test data

Assignment Help Other Subject
Reference no: EM132364184

Machine Learning Assignment - Problem Solving Task

Learning Outcomes - This assessment assesses the following Unit Learning Outcomes (ULO) -

1. Perform linear regression, and linear classification for two and more classes using logistic regression model.

2. Perform model assessment and selection for linear and logistic regression models.

Purpose - Demonstrate your skills for applying regularized logistic regression to perform two-class and multi-class classification for real-world tasks. You also need to demonstrate your skill in recognizing under-fitting/overfitting situations.

Instructions - This is an individual assessment task of maximum 20 pages including all relevant material, graphs, images and tables. Students will be required to provide responses for series of problem situations related to their analysis techniques. They are also required to provide evidence through articulation of the scenario, application of programming skills, analysis techniques and provide a rationale for their response.

Part 1 - Binary Classification

For this problem, we will use a subset of the Wisconsin Breast Cancer dataset. Note that this dataset has some information missing.

1.1 Data Munging

Cleaning the data is essential when dealing with real world problems. Training and testing data is stored in "data/wisconsin_data" folder. You have to perform the following:

  • Read the training and testing data. Print the number of features in the dataset.
  • For the data label, print the total number of B's and M's in the training and testing data. Comment on the class distribution. Is it balanced or unbalanced?
  • Print the number of features with missing entries (feature value is zero).
  • Fill the missing entries. For filling any feature, you can use either mean or median value of the feature values from observed entries. Explain the reason behind your choice.
  • Normalize the training and testing data.

1.2 Logistic Regression

Train logistic regression models with L1 regularization and L2 regularization using alpha = 0.1 and lambda = 0.1. Report accuracy, precision, recall, f1-score and print the confusion matrix.

1.3 Choosing the best hyper-parameter

A- For L1 model, choose the best alpha value from the following set: {0.1,1,3,10,33,100,333,1000, 3333, 10000, 33333} based on parameter P.

B- For L2 model, choose the best lambda value from the following set: {0.001, 0.003, 0.01, 0.03, 0.1,0.3,1,3,10,33} based on parameter P.

[Hints: To choose the best hyperparameter (alpha/lambda) value, you have to do the following:

  • For each value of hyperparameter, perform 10 random splits of training data into training (70%) and validation (30%) set.
  • Use these 10 sets of data to find the average validation performance P.
  • The best hyperparameter will be the one that gives maximum validation performance.
  • Performance is defined as: P='accuracy' if fID=0, P='f1-score' if fID=1, P='precision' if fID=2. Calculate fID using modulus operation fID=SID % 3, where SID is your student ID. For example, if your student ID is 356288 then fID=(356288 % 3)=2 then use 'precision' for selecting the best value of alpha/lambda.]

C- Use the best alpha and lambda parameter to re-train your final L1 and L2 regularized model. Evaluate the prediction performance on the test data and report the following:

  • Precision and Accuracy
  • The top 5 features selected in decreasing order of feature weights.
  • Confusion matrix

Part 2 - Multiclass Classification

For this experiment, we will use a small subset of MNIST dataset for handwritten digits. This dataset has no missing data. You will have to implement one-versus-rest scheme to perform multi-class classification using a binary classifier based on L1 regularized logistic regression.

2.1 Read and understand the data, create a default One-vs-Rest Classifier

1- Use the data from the file reduced_mnist.csv in the data directory. Begin by reading the data. Print the following information:

  • Number of data points
  • Total number of features
  • Unique labels in the data

2- Split the data into 70% training data and 30% test data. Fit a One-vs-Rest Classifier (which uses Logistic regression classifier with alpha=1) on training data, and report accuracy, precision, recall on testing data.

2.2 Choosing the best hyper-parameter

1- Choose the best value of alpha from the set a={0.1, 1, 3, 10, 33, 100, 333, 1000, 3333, 10000, 33333} by observing average training and validation performance P. On a graph, plot both the average training performance (in red) and average validation performance (in blue) w.r.t. each hyperparameter value. Comment on this graph by identifying regions of overfitting and underfitting. Print the best value of alpha hyperparameter.

[Hints: To choose the best hyperparameter alpha value, you have to do the following:

  • For each value of hyperparameter, perform 10 random splits of training data into training (70%) and validation (30%) set.
  • Use these 10 sets of data to find the average training and validation performance P.
  • The best hyperparameter shall be selected from the plot that shows both average training and validation performance against alpha value. While selecting the best alpha value you should consider overfitting and underfitting concepts.
  • Performance is defined as: P='accuracy' if fID=0, P='f1-score' if fID=1, P='precision' if fID=2. Calculate fID using modulus operation fID=SID % 3, where SID is your student ID. For example, if your student ID is 356288 then fID=(356288 % 3)=2 then use 'precision' for selecting the best value of alpha.]

2- Use the best alpha and all training data to build the final model and then evaluate the prediction performance on test data and report the following:

  • The confusion matrix
  • Precision, recall and accuracy for each class.

3- Discuss if there is any sign of underfitting or overfitting with appropriate reasoning.

References that may be helpful:

  • Finding missing values
  • Titanic Problem
  • Numpy: Sorting and Searching
  • Multiclass Classification

Attachment:- Machine Learning Assignment File.rar

Reference no: EM132364184

Questions Cloud

What is the amount of the companys total liabilities : The total assets of Sierra Company are $57,000. Owner's capital is $35,000; drawings are $7,000; revenues, $52,000; and expenses, $35,000.
Explain what generally accepted accounting principles are : Tell us what Generally Accepted Accounting Principles are and how they affect accounting in the United States? The response paper should be in APA format.
What should be diluted earnings per share for the year ended : Assuming an income tax rate of 28%, what should be diluted earnings per share for the year ended December 31, 2018?
Find the number of shares that should be used in computing : What is the number of shares that should be used in computing diluted earnings per share for the year ended December 31, 2018?
Evaluate the prediction performance on the test data : SIT720 Machine Learning Assignment - Problem Solving Task, Deakin University, Australia. Evaluate the prediction performance on the test data
Forward to dowd to solve the issues at the resort : Briefly refer to how you would implement the strategies or recommendations.
West indies yacht club has effective leadership : Do you think that the West Indies Yacht Club has effective leadership? Why or why not?
What are the main issues regarding motivation : What are the main issues regarding motivation in the case? What effect is this having on the staff?
Define how would you describe the schools of ethical thought : After reading the chapter and your other articles how would you describe the schools of ethical thought from an organizational perspective? Write the response.

Reviews

len2364184

9/1/2019 9:53:52 PM

This is an individual assessment task of maximum 20 pages including all relevant material, graphs, images and tables. Submission details- Deakin University has a strict standard on plagiarism as a part of Academic Integrity. Late submission penalty is 5% per each 24 hours. No marking on any submission after 5 days. Be sure to downsize the photos in your report before your submission in order to have your file uploaded in time. Successfully completed all tasks.

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd