Reference no: EM132268103
Advanced Data Analysis Assignment - Problem Set
Problem Set with "R Studio" software.
1. In Sections 5.3.2 and 5.3.3, we saw that the cv.glm() function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the glm() and predict.glm() functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the Weekly data set. Recall that in the context of classification problems, the LOOCV error is given in (5.4).
(a) Fit a logistic regression model that predicts Direction using Lag1 and Lag2.
(b) Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but the first observation.
(c) Use the model from (b) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if P(Direction="Up"|Lag1, Lag2) > 0.5. Was this observation correctly classified?
(d) Write a for loop from i = 1 to i = n, where n is the number of observations in the data set, that performs each of the following steps:
i. Fit a logistic regression model using all but the ith observation to predict Direction using Lag1 and Lag2.
ii. Compute the posterior probability of the market moving up for the ith observation.
iii. Use the posterior probability for the ith observation in order to predict whether or not the market moves up.
iv. Determine whether or not an error was made in predicting the direction for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.
(e) Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV estimate for the test error. Comment on the results.
2. We will now perform cross-validation on a simulated data set.
(a) Generate a simulated data set as follows:
> set.seed(1)
> x=rnorm(100)
> y=x-2*x^2+rnorm(100)
In this data set, what is n and what is p? Write out the model used to generate the data in equation form.
(b) Create a scatter-plot of X against Y. Comment on what you find.
(c) Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares:
i. Y = β0 + β1X + ε
ii. Y = β0 + β1X + β2X2 + ε
iii. Y = β0 + β1X + β2X2 + β3X3 + ε
iv. Y = β0 + β1X + β2X2 + β3X3 + β4X4 + ε
Note you may find it helpful to use the data.frame() function to create a single data set containing both X and Y.
(d) Repeat (c) using another random seed, and report your results. Are your results the same as what you got in (c)? Why?
(e) Which of the models in (c) had the smallest LOOCV error? Is this what you expected? Explain your answer.
(f) Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (c) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?
3. We will now consider the Boston housing data set, from the MASS library.
(a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate μ^.
(b) Provide an estimate of the standard error of μ^. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.
(c) Now estimate the standard error of μ^ using the bootstrap. How does this compare to your answer from (b)?
(d) Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95% confidence interval using the formula [μ^ - 2SE(μ^), μ^ + 2SE(μ^)].
(e) Based on this dataset, provide an estimate, μ^med, for the median value of medv in the population.
(f) We now would like to estimate the standard error of μ^med. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.
(g) Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity μ^0.1. (You can use the quantile() function.)
(h) Use the bootstrap to estimate the standard error of ^ 0:1. Comment on your findings.
4. In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.
(a) Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector " of length n = 100.
(b) Generate a response vector Y of length n = 100 according to the model
Y = β0 + β1X + β2X2 + β3X3 + ε,
where β0, β1, β2, and β3 are constants of your choice.
(c) Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X, X2, ..., X10. What is the best model obtained according to Cp, BIC, and adjusted R2? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both X and Y .
(d) Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?
(e) Now fit a lasso model to the simulated data, again using X, X2, ..., X10 as predictors. Use cross-validation to select the optimal value of λ. Create plots of the cross-validation error as a function of λ. Report the resulting coefficient estimates, and discuss the results obtained.
(f) Now generate a response vector Y according to the model
Y = β0 + β7X7 + ε
and perform best subset selection and the lasso. Discuss the results obtained.
5. In this exercise, we will predict the number of applications received using the other variables in the College data set.
(a) Split the data set into a training set and a test set.
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
6. We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.
(a) Generate a data set with p = 20 features, n = 1,000 observations, and an associated quantitative response vector generated according to the model
Y = X'β + ε = j=1∑pXjβj + ε,
where β has some elements that are exactly equal to zero.
(b) Split your dataset into a training set containing 100 observations and a test set containing 900 observations.
(c) Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size.
(d) Plot the test set MSE associated with the best model of each size.
(e) For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test set MSE is minimized for an intermediate model size.
(f) How does the model at which the test set MSE is minimized compare to the true model used to generate the data?
7. We will now try to predict per capita crime rate in the Boston data set.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross- validation, or some other reasonable alternative, as opposed to using training error.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
Attachment:- Assignment File.rar