Prebuilt tool and implementation in a data science project

Assignment Help Other Subject
Reference no: EM133530039 , Length: 10 pages

Data Cleaning and Analytics

Purpose of Assignment

This assignment aims at evaluating students' achievement of the following unit learning outcomes:

1. Explain the key concepts, techniques, and tools for cleaning the data and creating prediction models.

2. Work on feature and model selection with a bit of discovery of the prebuilt tool and implementation in a data science project.

Introduction

The dataset contains 100,000 tuples of 3 different financial credit score classes. There are 24 attributes included in the source data. We have two goals in this assignment: the first is cleaning and preparing the data for later use, and the second is building two predictive models to predict the "Credit Score" class. You are expected to follow the instructions for making your predictive model and answer questions.

Assignment Goal
This assignment aims to build experiences for students to clean the dataset, split the data into training and test sets, train usable predictive models, and explain the outputs. A small part of the discovery and research component is included in the assignment to expand the students' skill set.

Assignment Task
The dataset contains messy values as the dataset is collected from the real world. Your tasks are to clean the data and create the predictive models according to the instructions for answering the questions listed below. The source file is "data_2023.csv". The report should be prepared with the template and answer the questions. A table of content is not required.

Data Cleaning
You must follow the instructions to clean and split the given data set into training and test sets. Remember, a well-split dataset is the foundation of support for the model training and test. It is estimated that you will need to use around 30 nodes for data cleaning and partitioning before sending the partitioned data into the predictive models. Suggested nodes to be used include "File Reader," "Column Filter," "Rule-based Row Filter," "String Manipulation," "Math Formula," "Math Formula (Multi Column)," "Rule Engine," "Missing Value," "Shuffle," "Numerical Binner," "Feature Selection Loop Start (1:1)," and "Partitioning." You may see a warning sign on the "Missing Value" node stating, "The current settings use missing value handling methods that cannot be represented in PMML 4.2." It is normal; you can ignore it because we are not using PMML in the assignment.

Nafve Bayes Model
After partitioning the cleaned data into training and test sets, build a Na"ive Bayes classifier to predict the "Credit Score."

Random Forest Model
After partitioning the cleaned data into training and test sets, build a random forest classifier to predict the "Credit Score."

There are 100 marks on this assignment. Your proposal must address the following tasks.

Question 1. Follow the instructions to clea n the data and answer questions. If any of the nodes you used in the workflow has a random seed, set 3122 to the seed to fix the random state.

1) Our goal is to predict the credit score from the given data. There is/are one (or multiple) attribute(s) which is/are significantly irrelevant to the goal. Exclude the attribute(s) and give a persuasive rationale for that. The excluded attribute(s) is(are) , and the reason(s) for removing it(them) is(are).

2) After removing the selected attribute(s), let's start to remove tuples containing missing values.
Remove tuples only if any of the attributes listed below have missing values: "Month," "Age," "Occupation," "Annual Income," "Num Bank Accounts," "Num Credit Card," "Interest Rate," "Num of Loan," "Delay from due date," "Changed Credit Limit," "Credit Mix," "Outstanding debt," "Credit Utilization Ratio," "Credit History Age," "Payment of Min Amount," 'Total EMI per month," "Amount invested monthly," and "Payment Behaviour." Moreover, some tuples with infeasible values in the attributes, such as "Monthly Inhand Salary" < 0, "Num Bank Accounts" < 0, "Num Credit Card" < 0, and "Changed Credit Limit" contains " ", should also be removed. List the node(s) (in sequence) and the corresponding command(s) used in this process.

3) Check for the "Age" attribute to eliminate symbols that are not numbers to recover the data into the usual number format. Moreover, drop the tuples whose "Age" value is lower than or equal to 0 or greater than 120. List the node(s) (in sequence) and the corresponding command(s) used in this process.

4) Remove the non-numerical symbol in the "Annual Income" column and convert it to the double format. List the node(s) (in sequence) and the corresponding command(s) used in this process.

5) Convert the " " in the "Occupation" attribute to NulL Please note that Null is different from an empty string. Remove the non-numerical symbol in "Num of Loan" and convert it to integer data type. Take absolute values of attributes "Num Bank Accounts" and "Num Credit Card." Set values to 0 for the "Num of Loan" attribute if the original values are negative. Remove the non-numerical symbol in "Num of Delayed ayment" and convert it into integer format. Set the "Credit Mix" value to "Unknow" if the original value is " ".Remove the non-numerical symbol in "Outstanding Debt" and convert it into the double format. List the node(s) (in sequence) and the corresponding command(s) used in this process.

6) Convert the "Credit History Age" to the count of months and store it in the integer format. For example, if the original value from a tuple is "22 Years and 1 Months", the value will be 265 after the conversion (22 * 12 + 1 = 265). Store the converted result in a new attribute called 'Total CHA." List the node(s) (in sequence) and the corresponding command(s) used in this process.

7) Remove the non-numerical symbol in "Amount invested monthly" and convert it to the double format. Set the value to "Unknow" if the original value in "Payment Behaviour" attribute starts with "! @". Remove the non-numerical symbol in "Monthly Balance" and convert it to the double format. Convert "Changed Credit Limit" into the double format. List the node(s) (in sequence) and the corresponding command(s) used in this process.

8) Use the "Missing Value" node and use the "Next Value d " to replace missing values in all string type attributes. Use the "Previous Value"" in the same node to replace missing values in any numerical format. If the value of "Monthly Balance" is negative, replace the value with 0. Screenshot the pop-up window with the correct settings.

9) Simplify the 'type of Loan" attribute. If the original content has more than one type separated by a comma, keep only the first part. Otherwise, keep the full description if there is no comma included. For example, "Auto Loan, Credit-Builder Loan, Personal Loan, and Home Equity Loan" will become "Auto Loan", "Credit-Builder Loan" will still be "Credit-Builder Loan", and "Not Specified, Auto Loan, and Student Loan" will become "Not Specified" after the process. List the node(s) (in sequence) and the corresponding command(s) used in this process.

10) Bin the "Changed Credit Limit" attribute with six bins of ranges: [-co, -3.0), [-3.0, 0), [0, 3.0), [3.0, 6.0) , [6.0, 7.S) , and [7.S, co) and put the result into a new attribute called "Changed Credit Limit binned". Screenshot the pop-up window with the correct settings of your binner.

11) Remove all temporarily created or useless attributes. Use the "Feature Selection Loop Start (1:1)" node to select the feature. The class label should be excluded from the features in the feature selection node. The Genetic Algorithm is specified to be the feature selection strategy with default population size and the maximum number of generations. Again, 3122 should be used as the static random seed. After selecting features, shufile the data with seed 3122. The data should be partitioned by "Linear sampling", with 75% data in the training set and 25% in the test set. How many tuples and attributes (excluding the class label) are in the training set at the end? [5 marks]

Question 2. Build a Naive Bayes classifier using the training and test sets created in the previous task. Answer the following questions after completing the model training and test.
1) Give a screenshot of the Nai"ve Bayes classifier in the KNIME workflow. You can take the screenshot starting from the portioning node output to the end of the Nai"ve Bayes classifier part scorer.
2) The default probability should be 0.0001, the minimum standard deviation is 0.0001, the threshold standard deviation is 0, and the maximum number of unique nominal values per attribute should be set to 600 in the classifier. Screenshot the setting dialogue of your Naive Bayes Learner.
3) Screenshot the confusion matrix and the Accuracy statistics of the test result. If the bank wants to minimise the risk of lending money to customers, the "Good" in "Credit Score" should be the major target. Based on the current result, does the classifier perform satisfactorily?
4) Which measurement should we look at to interpret your conclusion in this case?

Question 3. Build a random forest classifier using the training and test sets created in the previous task. Answer the following questions after completing the model training and test. Use the information gain ratio as the split criterion and 3122 as the static random seed to build the random forest model.

1) Give a screenshot of the random forest classifier in the KNIME workflow. You can take the screenshot starting from the portioning node output to the end of the Naive Bayes classified part scorer.

2) Screenshot the confusion matrix and the Accuracy statistics of the test result.

3) If the bank wants to minimise the risk of lending money to customers, the "Good" in "Credit Score" should be the major target. Compare the measurements between random forest results and Nai"ve Bayes results. Which model presents a more suitable result? Which measure should be used to make the comparison?

4) Which class does the built random forest model perform the best? What measurement(s) should we look at to find the answer?

Reference no: EM133530039

Questions Cloud

How the systems leadership connect to continuous quality : how the systems leadership connect to continuous quality? identify and describe a leadership attribute that is needed by the nurse managers to successfully
Discuss the role of the nurse in mitigating conscious : Define the concept of ageism and describe how ageism and bias may impact healthcare service delivery across care contexts. Explore varied world views on ageing
Are social class and occupation of characters significant : Does the social, economic, political, or religious environment affect the lives of characters and help to shape the theme of the work?
Construct a definition for each term : construct a definition for each term (Mental Health vs. Mental Illness). Be sure to cite your sources within the text as well as providing
Prebuilt tool and implementation in a data science project : Data Cleaning and Analytics - Explain the key concepts, techniques, and tools for cleaning the data and creating prediction models
Provide a rationale for monitoring loc for a patient with : Provide a rationale for Monitoring LOC for a patient with Ischemic Stroke and critically analyse using available evidence-based literature to support
Contrast two kinds and everyday use : Compare and contrast "Two Kinds" and "Everyday Use" on as many levels as you can.
Discuss how you, as a novice registered nurse : Discuss how you, as a novice Registered Nurse, might support First Nations Peoples as health consumers in relation to one of the identified target areas from
Describe your view on climate change : Briefly describe the key pieces of evidence presented in the movie, supporting the climate change argument.

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd