What is the predictor that first splits the tree

Assignment Help Computer Engineering
Reference no: EM131963691

Assignment- Decision Tree and Naïve Bayes

Build Decision Tree Model

Packages required: Install and load C50, caret, rminerpackages

Data: The data are taken from Shmueli et al. (2010). The data set consists of 2201 airplane flights in January 2004 from the Washington DC area into the NYC area. The characteristic of interest (the response) is whether or not a flight has been delayed by more than 15 min.

The explanatory variables include three different arrival airports (Kennedy, Newark, and LaGuardia); three different departure airports (Reagan, Dulles, and Baltimore); eight carriers; a categorical variable for 16 different hours of departure (6 am to 10 pm); weather condition (0=good and 1 = bad); day of week (1 = Monday, 2 = Tuesday, 3 = Wednesday, ... , 6 = Saturday and 7 = Sunday);Here the objective is to identify flights that are likely to be delayed.

Tasks:

1) Import and explore data

a. Open FlightDelay.csv and store the results into a data frame, e.g., called datFlight. All of the character values should be imported as factors. Transform specific numeric values such as weather condition, day of week and day of month as factors.

b. Use the str() and summary commands to provide a listing of the imported columns and their basic statistics. Make sure that the data types are imported as expected.

2) Prepare data for classification

a. Using a seed of 100, randomly select 60% of the rows into training (e.g. called traindata). Divide the other 40% of the rows evenly into two holdout test/validation sets (e.g., called testdata1 and testdata2).

b. Inspect (show) the distributions of the target variable in the subsets. They should preserve the distribution of the target variable in the whole data set.

3) C5.0 decision tree classifiers

a. Build/train a tree model

i. Build the tree using the C50 function with default settings

ii. Show the (textual) model/tree.

iii. How many leaves are in the tree? (Note: In C50, the size of tree is the number of leaves. In J48, the size of the tree is the number of nodes, and J48 also provides the number of leaves.)

iv. What is the predictor that first splits the tree?

b. Find rules (paths) in the tree

i. Find one path in the tree to a leaf node that is classified to ontime. Starting with the condition on the first (or top) branch of the path, write down the conditions on the tree branches belonging to this path. Enclose a condition in a pair of parentheses and precede it with "If" - e.g.

If (house <= 600469),..., and (income <57578), then STAY

ii. How many conditions and how many unique predictors are in your selected rule?

iii. What is this rule's misclassification error rate (e.g., 20/50 misclassified)?

iv. Similarly, describe a rule that classifies an instance to delay.

v. What is this rule's misclassification error?

vi. Find a shorter or longer rule with fewer or more conditions for ontine than previous rules. Repeat this for Delay. Show these two rules and their misclassification errors.

vii. What are the reasons that long rules are included in a decision tree model?

viii. What is the disadvantage of a long rule?

c. Apply and evaluate the trained model to two hold-out testing sets, one set at a time. The process for each data set includes:

i. Generate predictions (i.e. estimations) of the values of the target variable for the testing instances.

ii. Generate a confusion matrix that shows the counts of true-positive, true-negative, false-positive and false-negative predictions for both testdata1 and testdata2. Consider Ontime as positive class.

iii. Generate seven performance metrics - Accuracy (percent of all correctly classified testing instances), and precision (percent of instances predicted to have a class are accurate), recall (also true positive) and F-measure (also F-score) of Ontime and of Delay respectively. (Note: References of performance metrics in the rest of the assignment refers to these seven metrics or a set of metrics that are inclusive of these.)

iv. Report all performance differences in the same performance metric between the two data sets that are more than 10%. Does this tree generalize well over these two testing sets? Explain the reason for your answer.

4) C50 pruning

a. Build another C50 tree using the train set by changing the confidence factor to 0.05 (i.e. CF=0.05 in C50 function's control).

b. Describe the size of the tree built.

c. Generate predictions, confusion matrixes and performance metrics using two test sets.

d. Report all performance differences in the same performance metric between the two data sets that are more than 10%. Does this tree generalize well over these two testing sets? Explain the reason for your answer.

e. Would you adopt this pruning setting? Why or why not?

5) Returning to the default pruning setting, build another C50 tree with only two predictors of your choice.

a. Build a tree using the predictors of your choice in the train set.

b. Describe the size of the tree built.

c. Generate predictions, confusion matrices and performance metrics using two test sets.

d. Report all performance differences in the same performance metric between the two data sets that are more than 10%. Does this tree generalize well over these two testing sets?

Build Naïve Bayes Model

1) e1071 naiveBayes classifiers

a. Prepare DelayFlight for building and evaluating Naïve Bayesian classifiers. Load the caret package. Using a seed of 100, 500 and 900, randomly select 67% of a file three times into three training sets and save other 33% in three testing sets respectively. Calculate the average number of examples in testing sets.

b. Use for loop to build and understand e1071 naiveBayes models with all predictors for delay.

i. Load the e1071 and rminer packages.

ii. Build a Naïve Bayesian models using the naiveBayes function in e1071 with each traindata.

iii. Show each model. What are the values of A-priori probabilities - P(Delay) for the delay class and P(Ontime) for the ontime class for each model?

iv. Generate predictions (i.e. estimations) of the values of the target variable for instances in each testdata.

v. Save the values of TP, TN, FP, FN and calculate the average of these four values after the loop.

vi. Save the value of performance metrics of three models on their corresponding testing samples and fill out the following table.

 

Accuracy

Precision_Delay

Precision_Ontime

Recall_Delay

Recall_Ontime

F1(Delay)

F1(Ontime)

Model1




 

 

 

 

Model2




 

 

 

 

Model3




 

 

 

 

Cost Sensitive Learning

1) Imbalanced target variable class distribution

a. What is the distribution proportion of target variable from original FlightDelay dataset? Which one is the majority class (more instances) and which one is the minority class (less instances)?

b. A simple majority_rule model always classifies all instances as the majority class which is the class that has more instances in a data set. This rule is a heuristic (man-made) rule. (No code needed for this questions)

i. Use the majority_rule model to classify all of the instances in FlightDelay.csv. How many TP, TN, FP and FN will this model generate? What is the accuracy rate of applying this model to FlightDelay.csv?

2) Cost-benefit calculations and cost-sensitive models using all of the predictors

a. Using the mean values of TP, FP, TN and FN from three C50 classifier testing results and the average number of test instances over all three test sets, calculate and print the average net-benefit per flight over all three testing results. Assume the following cost and benefit factors.

o Cost of sending notification message to a classified delay (Predicted as Delay): $50 per flight.
o Loss of delay waiting: $1000 for providing food and hotel service for customers per flight.
o Benefits of predicting a correct delay flight: $500
o No additional benefits from correctly classifying actual on time flight.

 

Predicted as

Actual

On Time

Delay

On Time

0

-1000

Delay

-50

500

b. Using the mean values of TP, FN, TN and FP from three naïvebayes classifier testing results and the average number of test instances over all three test sets, calculate and print the average net-benefit per customer over all three testing results. Assume the same cost and benefit factors.

c. Create a cost matrix to specify the cost of misclassifying a delay flight as aon time flight to be 10 times the cost of misclassifying a on time to delay.

d. In a For loop, build, predict and evaluate C50 classifiers using this cost matrix with three pairs of train and test sets. These are C50 cost-sensitive classifiers. Print the performance metrics for each testing set as well as the average value of each performance metric over three testing sets. Generate confusion matrix for each test set. Calculate and save the mean values of TP, FP, TN and FN over the three confusion matrices of testing results.

e. Using the mean values of TP, FN, TN and FP from three C50 cost-sensitive classifier testing results and the average number of test instances over all three test sets, calculate and print the average net-benefit per customer over all three testing results. Assume the same cost and benefit factors.

Reference no: EM131963691

Questions Cloud

Calculate the marginal costs and marginal benefits : Using the information in the table, calculate the marginal costs and marginal benefits of reducing sewage emissions for this city. What is the optimal level of
Is this policy shift by whatsapp morally justified : The policy shift is intended to help WhatsApp generate more revenue and makes economic sense; however, the change has raised concerns over the privacy.
Discuss the impact of china special administrative region : Write 4000 words Report: The Impact of China's Special Administrative Region on Macau Tourism. The response must be typed, single spaced.
How much is the present value of the perpetual bank fees : How much is the total benefits for all 4 boxes? How much is the present value of the perpetual bank fees?
What is the predictor that first splits the tree : What is the predictor that first splits the tree? How many conditions and how many unique predictors are in your selected rule?
Frictional or structural unemployment : How will this affect frictional or structural unemployment? Decribe how?
Considering adding fourth model to product line : Clear View, manufacturer of inexpensive line of tablets, Clear View is considering adding fourth model to product line.
How do the three characteristics of big data apply : Retailers are known for collecting huge amounts of customer data. How do the three characteristics of big data apply to the collecting process?
Describe a natural monopoly : Describe a natural monopoly. Illustrate, and explain, the three "regulatory

Reviews

Write a Review

Computer Engineering Questions & Answers

  Mathematics in computing

Binary search tree, and postorder and preorder traversal Determine the shortest path in Graph

  Ict governance

ICT is defined as the term of Information and communication technologies, it is diverse set of technical tools and resources used by the government agencies to communicate and produce, circulate, store, and manage all information.

  Implementation of memory management

Assignment covers the following eight topics and explore the implementation of memory management, processes and threads.

  Realize business and organizational data storage

Realize business and organizational data storage and fast access times are much more important than they have ever been. Compare and contrast magnetic tapes, magnetic disks, optical discs

  What is the protocol overhead

What are the advantages of using a compiled language over an interpreted one? Under what circumstances would you select to use an interpreted language?

  Implementation of memory management

Paper describes about memory management. How memory is used in executing programs and its critical support for applications.

  Define open and closed loop control systems

Define open and closed loop cotrol systems.Explain difference between time varying and time invariant control system wth suitable example.

  Prepare a proposal to deploy windows server

Prepare a proposal to deploy Windows Server onto an existing network based on the provided scenario.

  Security policy document project

Analyze security requirements and develop a security policy

  Write a procedure that produces independent stack objects

Write a procedure (make-stack) that produces independent stack objects, using a message-passing style, e.g.

  Define a suitable functional unit

Define a suitable functional unit for a comparative study between two different types of paint.

  Calculate yield to maturity and bond prices

Calculate yield to maturity (YTM) and bond prices

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd