Evaluate menu to detemine which is the best tool

Assignment Help Other Subject
Reference no: EM132655837

Classification Using Rattle

A hypothetical Melbourne suburb was surveyed with a view to its redevelopment potential. In particular there was interest in finding adjacent properties for more intensive redevelopment. Only 2887 (2.5%) of more than 45,000 properties were redeveloped between 2004 and 2009, making this a relatively rare event. Our goal is to predict which 2004 properties were redeveloped between 2004 and 2009 based on various 2004 variables and recent changes in the immediate neighbourhood. The data should be partitioned 70:15:15 for training, validation and testing. Only a few of the available variables are considered in this assignment. Please include all outputs for each question.

Names of Variables

Measureme nt

Scale

Example Property (*)

Description

DwellingsConstructed_2 00m

Interval

4

Number of dwellings constructed within 200m between 2000 and 2004

NetDwellingIncrease_20 0m

Interval

3

Increase in number of dwellings within 200m between 2000 and 2004

redevPotIndex_2004

Interval

.025

2004 assessment of redevelopment potential based on

property dimensions

strata

Binary

0

Strata housing (1=yes, 0=no)

BuildingProjects_200m

Interval

2

Number of building projects within 200m between 2000 and 2004

Demolitions_200m

Interval

1

Number of demolitionswithin

200m between 2000 and2004

Road Frontage(m)

Interval

20

Length of road frontage

Redeveloped 2004-2009

Binary

?

The response/target variable coded equal to one for properties redeveloped between 2004 and

2009, 0 otherwise.

a) The redevelop.csv data contains data for a random sample of the properties that were not redeveloped and all the properties that were redeveloped, resulting in a data set containing a total of only 7409 properties.
i) Why was only a random sample of the properties that were not redeveloped between 2004 and 2009 chosen?
ii) What else could have been done to achieve a similar effect?

b) Open R and include the rattle package. What instructions did you use to do this?

c) Read there develop.csv data in to Rattle and assign appropriate roles to your variables. Note that the partition is 70% for training, 15% for validation and 15% for testing.

What is thetargetvariable?

d) Produce suitable plots to visualise the differences in the distributions of the input variables for properties that were and were not redeveloped. Try to show at least six different types of plot.

e) Fit a classification tree for redeveloped properties assuming a loss matrix with losses half as big for a false negative (Redeveloped="No" when it should be Redevelop="Yes") as a false positive (Redeveloped="Yes" when it should be Redevelop="No"). Assume no losses when a correct decision is made. Answer the following questions after drawing your tree for the training data. Be sure to maximise your tree window before drawing your tree (again).

i. Complete the above loss matrix.

ii. What are the rules for the terminal node with the smallest errorrate?

iii. How many splits if we want to minimise the cross-validation error? Explain your answer

iv. Consider node 2 of your drawn tree. How many training observations for node 2 and what are the rules for node2?

v. At node 2 in the training data what is the average loss per property if we make a Redevelopment="Yes" decision? What is the average loss per property if we make a Redevelopment = "No" decision? Which is the better decision for this node?

vi. Repeat (v) for some other node where the better decision is unexpected. Explain why the better decision is unexpected.

f) Run a random forest with your data with 500 trees, randomly selecting three input variables from which to choose your split variable at each node. Please include all outputs for each question.

i. What is the OOB estimate of the error rate and what does OOB mean?
ii. What is the error rate for the Redevelopment = "Yes" predictions with thetestdata and what is the error rate for Redevelopment = "No" predictions with the testdata?
iii. Which are the top 3 predictor variables according to the Gini measure of variable importance and how is this measure defined?

g) Now try Boosting. Please include all outputs for each question.

i. Interpret the term Gain and explain why this measure provides a reliable measure of Variable Importance.

ii. What does the Error Plot suggest as the optimum number of trees?

h) Now try a neural network with two and then three hidden nodes. Use the Evaluate menu error matrix to answer the following questions. Please include all outputs for each question.

i. Is it necessary to transform any of the input variables? What transformations have you chosen and why?
ii. What is the error rate for properties that actually were redeveloped. Consider only the test data assuming first 2 and then 3 hidden nodes?
iii. What is the error rate for properties that were not actually redeveloped. Consider only the test data assuming first 2 and then 3 hidden nodes?
iv. Which is better a 2 hidden node or a 3 hidden node solution?Why?

i) Use the Evaluate menu to detemine which is the best tool for modelling your data; a single tree, a random forest, boosting, a neural network. Why have you chosen this one method over the other three methods?

j) For this best tool show the ROC, sensitivity, risk and lift charts for the test data ONLY.

k) Explain the axes for each of the above four charts.

l) Which is the best method for choosing the most important predictor of Redevelopment = "Yes"; plots, a single tree, a random forest, boosting, a neural network? Why have you chosen this one method over the other four methods?

m) Do any of the above models appear to be worth commercialising? For what purpose?

Attachment:- Exercise.rar

Reference no: EM132655837

Questions Cloud

How a small business could combine a concentrated : Give an example of how a small business could combine a concentrated marketing strategy
Discuss issues surrounding representativeness : Discuss the issues surrounding representativeness and ways to increase overall representativeness in state government.
What do you do when someone gets sick : Think for a while about cultural practices and how they affect health or illness in your own family. They may be difficult to identify as such at first.
What is the overhead rate per machine hour : Flawless Cosmetic Company manufactures and distributes, If Flawless changes its allocation basis to machine hours, what is the overhead rate per machine hour?
Evaluate menu to detemine which is the best tool : Evaluate menu to detemine which is the best tool for modelling your data; a single tree, a random forest, boosting, a neural network
What is the main function of legislative branch : What is the main function of the legislative branch? What role does the executive branch play in the formation of laws?
Identify the benefits and costs associated : Identify the benefits and costs associated with each option available to the government. Illustrate how each policy response will impact the marco-economy.
What the average cost of product is closest to : The company makes 410 units of product O37W a year, According to the activity-based costing system, the average cost of product O37W is closest to
Calculate south africa nominal gdp in 2018 and 2019 : Suppose that South Africa produces only two goods, sanitisers and masks. The base year is 2018 and the table below gives the quantities

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd