Reference no: EM132343841
Assignment -
Answer the following questions. Answers should be uploaded in a neat, easy-to-read Word document. Move all graphs, charts, and tables to the single document. Do not upload spreadsheets. Be sure to read this week's written lecture for links and other helpful information. Necessary datasets are linked. These questions ask you to explain, describe, or outline something in addition to the program output. The essay parts count for about 50 percent of the points in your answer so be sure and include well-considered, detailed explanation and discussion in your own words. Use APA style references and citations if needed. Copying and pasting or similar plagiarism/cheating will result in zero points on the entire assignment. These questions are from Chapter 6, Shmueli, Bruce, and Patel.
1. The file BostonHousing (See attached) contains census information concerning housing in Boston, MA. The dataset has information on 506 housing tracts. The dataset contains 12 predictor variables and one outcome variable, MEDV, median house price. See the text for a table containing variable descriptions or run Analytic Solver (XLMiner) or similar software to view the data description. The following questions refer to this dataset.
2. Why is the data partitioned into training and validation sets as part of the data mining process? What is the purpose of each?
3. Fit a multiple linear regression model to the median house price as a function of CRIM, CHAS and RM using Solver or SPSS Modeler. Use the coefficient table in the output to write the linear equation predicting the median house price.
4. Examine the various predictor variables. Which predictors are likely to be measuring the same thing? Discuss the relationships among INDUS, NOX, and TAX.
5. Compute the correlation table for the numerical predictors and look for highly correlated pairs. These could cause multicollinearity. Which ones should be removed?
6. Use exhaustive search to reduce the remaining predictors. Choose the top three models. Run each on the training set and compare their accuracy for the validation set. Compare RMSE, average error, and lift charts. Describe the best model.
Attachment:- Assignment & Data Files.rar