Compare the various estimator-predictors

Assignment Help Econometrics
Reference no: EM131550237

Econometrics and Big Data - Victor Chernozhukov and Christian Hansen

Final Problems

1. Use the data in nettfa.csv to answer this question. The goal of this exercise is simply to use machine learning/nonparametric modeling to build a model for prediction. The data consist of 9915 observations on 9 variables defined in the file nettfa readme.txt. Before answering parts a.-h. below, remove 3915 observations which will be used for an out-of-sample comparison in part i.

a. Estimate E[net_tfa].

b. Estimate E[net_tfa|X = {age, inc, fsize, educ, db, marr, male, twoearn}] using linear re- gression.

c. Estimate E[net_tfa|X = {age, inc, fsize, educ, db, marr, male, twoearn}] using k-nn with the number of neighbors chosen by cross-validation. Carefully explain how you define the distance between observations. Comment on how many neighbors you chose and how this relates to the sample size.

d. Estimate E[net_tfa|X = {age, inc, fsize, educ, db, marr, male, twoearn}] using series with basis elements chosen by cross-validation. Carefully document the basis you are using and how you chose to add elements to the expansion along the path considered for cross- validation. Comment on which terms you end up selecting.

e. Estimate E[net_tfa|X = {age, inc, fsize, educ, db, marr, male, twoearn}] by (1) lasso, (2) ridge, and (3) elastic net with penalty parameters chosen by cross-validation. Use the same dictionary of approximating functions for each of the three methods. Carefully explain how you construct the dictionary of approximating functions and your motivation for the functional forms considered. Comment on the estimated models.

f. Estimate E[net_tfa|X = {age, inc, fsize, educ, db, marr, male, twoearn}] using a CART with cost-complexity chosen by cross-validation. Comment on the final tree structure.

g. Estimate E[net_tfa|X = {age, inc, fsize, educ, db, marr, male, twoearn}] using a random forest. Note how many bootstrap replication you use and any other tuning you do. Which variables seem most important in the forest fit?

h. Estimate E[net_tfa|X = {age, inc, fsize, educ, db, marr, male, twoearn}] using boosted regression trees with number of boosting iterations chosen by cross-validation. Comment on the tree depth you use and how you made this choice. Which variables seem most important in the boosted tree fit?

i. Use the 3915 observations you held out to compare the estimates obtained in parts a.-h. Specifically, let g^j (x) for j ∈ {a, b, c, d, e(1), e(2), e(3), f, g, h, i} be the estimator of the conditional expectation obtained in the part of the question corresponding to j. Calculate the mean square forecast error as 1/3915 ∑i∈hold-out(g^j (xi) - yi)2. Which procedure is best?

Do the performance discrepancies seem large? [Note: Assuming independent sampling, you can compute a standard error for the mean square forecast error.]

2. To answer this question, use the R file "Simmons Reanalysis.R" which analyzes an example from Simmons, Nelson, and Simonsohn (2011, Psychological Science ) meant to illustrate the po- tential impacts of unprincipled variable selection. The key variables are "age" which is simply the respondent's age and serves as the dependent variable and "when64" which is a randomly assigned treatment variable. Specifically, the experiment randomly assigned subjects to the treat- ment group, where subjects listened to "When I'm 64" by the Beatles, or the control group, where subjects listened to a song unrelated to age.

a. What is estimated in block 1 of the code? Why is this a sensible object to look at in the context of the randomized control trial? Interpret the results obtained by running block 1 of the code.

b. What is estimated in block 2 of the code? (Note that "dad" is age of respondent's father.) In principle, why is this a sensible object to look at in the context of a randomized control trial? (You can essentially restate the final part of this question to answer this.) How does your answer to the second question change when you learn that the control was selected by looking for the variable that led to the largest decrease in the p-value for testing the null hypothesis of no treatment effect? How do you respond to the argument that "1) random assignment of the treatment means that all controls are independent of the treatment variable and thus can be included as a control without introducing bias, 2) the reason to include controls is to reduce residual variation and therefore increase precision in learning the treatment effect, and 3) we should thus search for the variable(s) that lead to the largest increase in precision and use these"?

c. What is estimated in block 3 of the code? In principle, why is this a sensible object to look at in the context of a randomized control trial? Explain how the procedure in block 3 addresses the problem raised in block 2 (discussed in part (b))? I.e. highlight the key distinction(s) between the mechanism for selecting control variables in block 3 and block 2.

3. This exercise is intended to have you compare the various estimators/predictors that we discussed in class. It is deliberately kept broad and somewhat vague - feel free to do more work than what's asked for.

a. Assume that we're interested in predicting the growth rate of a country, based on the country characteristics. Download the Barro-Lee data (accessible via hdm package).1 Why does this correspond to the "big p" case? Why should we worry about overfitting in this case?

b. Consider several predictors that would allow us to potentially get rid of the overfitting problem. These include, but are not limited to,
- OLS with fewer, carefully chosen regressors ("small OLS")
- Lasso (with the penalty level λ chosen via plug-in method),
- post-Lasso (with the penalty level λ chosen via plug-in method),
- Lasso (with the penalty level λ chosen via cross-validation),
- Ridge Estimator (with the penalty level λ chosen via cross-validation),
- Elastic Net (with the penalty level λ chosen via cross-validation),
- Random Forests
- Pruned Trees.

Which one do you think would perform better? (i.e. Would you expect this model to the dense or sparse? How would you pick the regressors in small OLS? Do we really know how Random Forests work?) Speculate.

c. Split the data into training and test samples, estimate coefficients using the training sam- ple, and run predictions for the test sample. Compute the out-of-sample performance of your predictors by computing the MSE for prediction on test sample. Calculate the 95% confidence intervals for MSE. How do the predictors compare? Discuss.

d. Now, let's get causal. Assume we're interested in estimating the effect of initial level of per-capita GDP on the growth rate (known as the infamous "convergence hypothesis"). The specification is

yi = α0di + ∑j=1Pβjxij + ∈i,

where yi is the growth rate of GDP over a specified decade in country i, di is the log of GDP at the beginning of the decade, and the xij are country characteristics at the beginning of the decade. The convergence hypothesis holds that α0 < 0. Test the convergence hypothesis via the Frisch-Waugh-Lovell partialling out using Lasso, post-Lasso, and Random Forest. Give intuition.

Attachment:- data.rar

Verified Expert

This task provide a clear working example on lasso regression and the workings are provided using R. Simple linear regression analysis was used to predict a dependent variable using one or more independent variables. Pearson correlation coefficient was used to determine the relationship between two quantitative variables. Once the significant relationship is found, we use simple linear regression analysis to predict the future cause of dependent variables

Reference no: EM131550237

Questions Cloud

Evaluate what impact legal and regulatory requirements : Evaluate what impact local, state, and federal laws have on the health care industry - Evaluate what impact legal and regulatory requirements.
What are some signs of group cohesion issues : What are some signs of group cohesion issues? Have you ever been on a team that was not all on the same page? What are some things you can do to fix the problem
How a manager can accommodate everyone involved : Given the varying needs of guests at an event, explain how a manager can accommodate everyone involved.
Proper methods of financial management : Given that a business cannot function without money, outline proper methods of financial management and accounting in special events.
Compare the various estimator-predictors : Compare the various estimators/predictors that we discussed in class. It is deliberately kept broad and somewhat vague - feel free to do more work than what's asked for.
Find the mean and variance of x using the cf : Bernoulli trials. Consider a sequence of Bernoulli trials with probability of success p, and that of failure q = 1 - p. The number of trials that precede.
The hugger-mugger and hp erp implementations : What were the key project management strategies that may have been used to minimize Go-live problems with the HP SAP Go-live process?
Success of an event : 1. Describe two factors that are evaluated within a feasibility study that may affect the success of an event.
What type of relationship exists in this case : Explore types of principal-agency relationships and determine what type of relationship exists in this case between the driver, Jose Carcamo

Reviews

inf1550237

7/30/2018 1:58:30 AM

This task is based on machine learning, and so many questions are quite complex. Thanks to Experts Mind that they really solved each and every problems of assignment and it really works. They provided me the solution in given deadline. Thanks to them again.

Write a Review

Econometrics Questions & Answers

  Design a simple econometric research project

Design a simple econometric research project

  Multiplicative decomposition method

Multiplicative decomposition method

  Market for cigarettes

The Australian government administers two programs that affect the market for cigarettes.

  Solve the forecast model

Solve the forecast model

  What are the marginal abatement cost functions

What are the marginal abatement cost functions for each of the two areas? Calculate the loss in the two areas due to over-control (for the rural area) and under-control (for the urban area).

  Write the t statistic for testing the null hypothesis

Explain why this model violates the assumption of no perfect collinearity.  Write the t statistic for testing the null hypothesis

  What is economics system

What is economics system? What are the types of economics system? Briefly explain each type of economics system by giving examples of nations that are close to each type

  Multiple choice questions related to market concentration

Determine when a competitively produced product generates negative externalities in production, the industry will,

  Calculating number of units produced by firm

Assume a company has the following production function: Q = 100 K.5 L1 . Currently, the company hires 1,000 workers and employs 100 units of capital.

  Question about mobile commerce

M-commerce also known as mobile commerce is being lumped in with several strategic internet plans. Explain some of the industries that are likely to use mobile commerce and how it is working for them.

  Calculating the average days past due and average flow time

Auto Data manufactures custom engineering testing machine. The following 5-orders are currently in the design department:

  Mechanism of an english auction and second price auction

Briefly discuss the difference between mechanism of an oral or English auction and a Vickrey or second price auction.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd