Reference no: EM132569340
Task - Exploratory Data Analysis and Linear Regression Analysis
Carefully study the Data Dictionary for Boston Housing Data Set (See Table 1) and accompanying description of each variable. It is important to understand this data set as it is used for Task 2 and Task 3 in Assignment 2. Each record in the housing.csv data set describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970.
Variable
|
Description
|
Type
|
crim
|
Per capita crime rate by town
|
real
|
zn
|
proportion residential land zoned for lots over 25,000 sq.ft.
|
real
|
indus
|
proportion of non-retail business acres per town.
|
real
|
chas
|
Charles River dummy variable (1 if tract bounds river; else 0)
|
integer
|
nox
|
nitric oxides concentration (parts per 10 million)
|
real
|
rm
|
average number of rooms per dwelling
|
real
|
age
|
proportion of owner-occupied units built prior to 1940
|
real
|
dis
|
weighted distances to five Boston employment centres
|
Real
|
rad
|
index of accessibility to radial highways
|
Integer
|
tax
|
full-value property-tax rate per $10,000
|
real
|
ptratio
|
pupil-teacher ratio by town
|
real
|
bk
|
where Bk is the proportion of African Americans by town
|
real
|
Istat
|
% lower status of the population
|
real
|
medv
|
Median value of owner-occupied homes in $1000's
|
real
|
Note: You should conduct some desktop research to identify determinates/drivers of Housing prices in order to fully understand and interpret the key findings of the exploratory data analysis (EDA) and Linear Regression Models for the housing.csv data set for Task 2 and visual presentation of the housing.csv data set in Task 3.
Task 2.1) Conduct and report on exploratory data analysis (EDA) of the housing.csv data set using RapidMiner Studio data mining tool. Note this will require use of a number of RapidMiner operators
Provide following for Task 2.1:
a screen capture of your final EDA process, briefly describe your EDA process
(ii) summarise key results of your exploratory data analysis in Table 2.1 Results of Exploratory Data Analysis for housing.csv. Table 2.1 should include key characteristics of each variable in housing.csv set such as maximum, minimum values, average, standard deviation, most frequent values (mode), missing values and invalid values etc.
(iii) Discuss key results of exploratory data analysis presented in Table 2.1 and provide a rationale for selecting top 5 variables for predicting median house value (medv), in particular focusing on the relationships of independent variables with each other and with dependent variable median house value (medv) drawing on results of EDA analysis and relevant literature on determinates of house prices
Task 2.2) Build and report on Linear Regression model for predicting medv using RapidMiner data mining process and appropriate set of data mining operators and a reduced set of variables from housing.csv data set as determined by your exploratory data analysis in Task 2.1.
Provide the following for Task 2.2:
(i) A screen capture of Final Linear Regression Model process and briefly describe your Final Linear Regression Model process
(ii) Table 2.2 named Results of Final Linear Regression Model for Task 2.2 for housing.csv data set.
(iii) Discuss the results of Final Linear Regression Model for housing.csv data set drawing on key outputs (coefficients, standardised coefficients, t-statistics values, p-values and significance levels etc) for predicting median house value (medv) and relevant supporting literature on interpretation of a Linear Regression Model.