Reference no: EM133122877
Access the repository and use the dataset provided by the instructor.
Describe a predictive model you could build using this dataset and present it, including the analysis of the residuals.
Verify the assumptions of the linear regression using plots. Explain each test, comment on the R code, and interpret the test result for:
a) Linear relationship: Each predictor variable xi and the outcome variable y. Use the plot function and which=1;
b) Independence: The residuals are independent. Use durbinWatsonTest function in the car package;
c) Homoscedasticity: The residuals have constant variance at every level of x. Use plot function and which=3; and
d) Normality: The residuals of the model are normally distributed. Use the plot function and which=2.
Verify the assumption of homoscedasticity computationally using the Non-Constant Variance Score (NCV) test (ncvTest function). Explain the test, the R code, and interpret the test results. Explain the relationship between NCV and the Durbin-Watson test.
Assess the presence of significant outliers and explain their potential impact. Use the plot function and which=5, and the Cook distance.
Explain how you would mitigate the impact of the outlier. Then, implement measures to mitigate the impact of outliers (e.g., removal or other transformations of the data).
Discuss what other tests/transformations you could use if there are no outliers. Could a log transformation for reducing skewness be used? Explain.
Repeat steps 4-5, using the "cleaner" dataset.
Test collinearity using the Variance Inflation Factor. Use the vif function in R.
Assess the overall validity of the regression model and present your final assessment regarding its readiness and suitability for making predictions.
Given your analysis, describe the recommendations you would make to a researcher intent on using this data in a multiple linear regression model