Fit a logistic regression model that predicts Direction

Assignment Help Econometrics

Reference no: EM132268103

Advanced Data Analysis Assignment - Problem Set

Problem Set with "R Studio" software.

1. In Sections 5.3.2 and 5.3.3, we saw that the cv.glm() function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the glm() and predict.glm() functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the Weekly data set. Recall that in the context of classification problems, the LOOCV error is given in (5.4).

(a) Fit a logistic regression model that predicts Direction using Lag1 and Lag2.

(b) Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but the first observation.

(c) Use the model from (b) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if P(Direction="Up"|Lag1, Lag2) > 0.5. Was this observation correctly classified?

(d) Write a for loop from i = 1 to i = n, where n is the number of observations in the data set, that performs each of the following steps:

i. Fit a logistic regression model using all but the ith observation to predict Direction using Lag1 and Lag2.

ii. Compute the posterior probability of the market moving up for the ith observation.

iii. Use the posterior probability for the ith observation in order to predict whether or not the market moves up.

iv. Determine whether or not an error was made in predicting the direction for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0.

(e) Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV estimate for the test error. Comment on the results.

2. We will now perform cross-validation on a simulated data set.

(a) Generate a simulated data set as follows:

> set.seed(1)

> x=rnorm(100)

> y=x-2*x^2+rnorm(100)

In this data set, what is n and what is p? Write out the model used to generate the data in equation form.

(b) Create a scatter-plot of X against Y. Comment on what you find.

(c) Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares:

i. Y = β₀ + β₁X + ε

ii. Y = β₀ + β₁X + β₂X² + ε

iii. Y = β₀ + β₁X + β₂X² + β₃X³ + ε

iv. Y = β₀ + β₁X + β₂X² + β₃X³ + β₄X⁴ + ε

Note you may find it helpful to use the data.frame() function to create a single data set containing both X and Y.

(d) Repeat (c) using another random seed, and report your results. Are your results the same as what you got in (c)? Why?

(e) Which of the models in (c) had the smallest LOOCV error? Is this what you expected? Explain your answer.

(f) Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (c) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?

3. We will now consider the Boston housing data set, from the MASS library.

(a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate μ^{^}.

(b) Provide an estimate of the standard error of μ^{^}. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.

(d) Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95% confidence interval using the formula [μ^{^}- 2SE(μ^{^}), μ^{^} + 2SE(μ^{^})].

(e) Based on this dataset, provide an estimate, μ^{^}_med, for the median value of medv in the population.

(f) We now would like to estimate the standard error of μ^{^}_med. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

(g) Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity μ^{^}_0.1. (You can use the quantile() function.)

(h) Use the bootstrap to estimate the standard error of ^ 0:1. Comment on your findings.

4. In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.

(a) Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector " of length n = 100.

(b) Generate a response vector Y of length n = 100 according to the model

Y = β₀ + β₁X + β₂X² + β₃X₃ + ε,

where β₀, β₁, β₂, and β₃ are constants of your choice.

(c) Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X, X², ..., X¹⁰. What is the best model obtained according to C_p, BIC, and adjusted R²? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both X and Y .

(d) Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?

(e) Now fit a lasso model to the simulated data, again using X, X², ..., X¹⁰ as predictors. Use cross-validation to select the optimal value of λ. Create plots of the cross-validation error as a function of λ. Report the resulting coefficient estimates, and discuss the results obtained.

(f) Now generate a response vector Y according to the model

Y = β₀ + β₇X⁷ + ε

and perform best subset selection and the lasso. Discuss the results obtained.

5. In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

6. We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.

(a) Generate a data set with p = 20 features, n = 1,000 observations, and an associated quantitative response vector generated according to the model

Y = X'β + ε = _j=1∑^pX_jβ_j + ε,
where β has some elements that are exactly equal to zero.

(b) Split your dataset into a training set containing 100 observations and a test set containing 900 observations.

(c) Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size.

(d) Plot the test set MSE associated with the best model of each size.

(e) For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test set MSE is minimized for an intermediate model size.

(f) How does the model at which the test set MSE is minimized compare to the true model used to generate the data?

7. We will now try to predict per capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross- validation, or some other reasonable alternative, as opposed to using training error.

Attachment:- Assignment File.rar

Reference no: EM132268103

Questions Cloud

Case study - the chronology of the incident : Case study detail the chronology of the incident, the threat actor or suspected threat actor, motives, target, technical means and non-cyber aspects

Impact of layoff on layoff survivors within organization : How does downsizing affect organizations in the short term? In the long term? What is the impact of a layoff on layoff survivors within the organization?

Identify teaching needs from the health history : Recognize the influence that developmental stages have on physical, psychosocial, cultural, and spiritual functioning. Identify teaching/learning needs from.

Create a research plan that compiles primary resources : First, applying what you just learned about narrowing research questions, revise your research questions from your Topic Exploration Worksheet.

Fit a logistic regression model that predicts Direction : ECON 556X Advanced Data Analysis Assignment - Problem Set, Binghamton University, USA. Fit a logistic regression model

Do the constitutional compromises over slavery seem balanced : Do the Constitutional compromises over slavery seem balanced-that is, were the interests of both regions reasonably protected?

What are the factors that influence employee retention : What are the factors that influence employee retention?

Communication technology advancements : Discuss the advantages and disadvantages of communication technology advancements in today's global environment.

What would your focused clinical assessment include : It is anticipated that the initial discussion response should be in the range of 250-300 words. Response posts must demonstrate topic knowledge and scholarly.

Reviews

len2268103

3/27/2019 11:19:28 PM

Problem Set with "R Studio" software. For the homework it stats in the PDF that I need the R codes too, just making it clear. Note: please submit hard copies of your answers and send me your codes via email. Submissions without codes will be considered incomplete and will only receive partial credit even if all answers are correct. Reminder: Collaboration in homework assignments is encouraged. However, you are required to work out details by yourself. Identical assignments are not allowed and will be penalized. Late answers cannot be accepted.

Write a Review

Required(*) Message

User Account

All Pages