Reference no: EM13793166
Question 1. Suppose the process that determines child test score in Kindergarten is given by
test scorei = β0 + β1 preschooli + β2 incomei + ∈i
where β1, β2 > 0 and preschooli is a continuous measure of "preschool quality.η
Preschool quality can be purchased with cash, and is purchased in cash according to the linear model
preschooli = α0 + α1incomei + ηi
You should interpret both of these linear models as structural models of human behavior. In other words, if you gave a random family another dollar, the family would indeed purchase α1 more units of preschool quality, and the family would purchase other children's stuff that has β2 additional effect on test scores. This means that the total effect on test scores of giving the family another dollar is.
(a) α0β1
(b) α1
(c) β2 + β1α1
(d) α1β1 + β2
Question 2. Let's try to simulate model in Stata. This will generate a fake dataset of 1000 students for this problem assuming that α0 = β0 = 0, α1 = 0.5, and β1 = β2 = 1.
(You can easily play around by modifying these parameter assumptions.)
Regress testscore on preschool. As an estimator for β1, this regression's coefficient is
(a) biased upward
(b) biased downward
Question 3. A big city government is thinking about implementing a program that will raise a family's preschool quality by 1 unit. An analyst uses the regression you ran in the previous question to make the case that this program is a highly effective way to improve student test scoresin particular it is better than direct income supports.
You counter that students who go to good preschools probably come from wealthier backgrounds, so at the very least the analyst should be controlling for family income. The analyst retorts. "these kinds of families wouldn't spend a dime of their own income on preschool!η
The analyst is proposing the testable null hypothesis that at least for the families he is considering,
(a) β2 = 0
(b) β1 = 0
(c) α1 = 0
(d) α0 = 0
Question 4. Were the analyst's claim true, his estimator for β1 would be
(a) unbiased and consistent
(b) biased and inconsistent
Question 5. You gather appropriate data and show that even in the analyst's population, α1 is significantly positive. The analyst then suggests that now that we have the nice data you collected, why don't we run the regression
test scorei = β0 + β1preschooli + β2incomei + ∈i
and then check whether β1 exceeds β2c, where c is the cost of a unit of government-provided preschool in dollars?
He reasons that if β1 > β2c, the government should provide preschool since the test score return per dollar for preschool exceeds that for income supports, otherwise the government should provide income supports.
Why might the analyst be wrong?
(a) The estimator for β1 is inconsistent because of the inclusion of income on the RHS of the regression
(b) One effect of income supports may be to increase family preschool purchases, thus the effect of income supports is probably greater than β2 (too-many variables bias).
Multicollinearity.
Question 6. Now we want to know whether income supports during preschool or before preschool are better.
We consider the model
test scorei = β0 + β1income_beforei + β2income_duringi + ∈i
Try part II of the simulation. There, income_during is income_before plus a very small income change.
Regress test score on income_before and income_during. You'll notice that at N = 100 observations, the confidence intervals for these parameters are very loose. Try N = 1000 observations by changing the -set obs 100-code to -set obs 1000-. Then try N = 10000. Notice that you need a ton of observations before the confidence intervals really start to tighten. Why?
(a) Hard to distinguish income_before from income_during.
(b) The OLS estimators for β1 and β2 are inefficient.
(c) The OLS estimators for β1 and β2 are biased.
(d) Hard to distinguish income_during from the constant.
Heteroskedasticity and weighted least squares.
The above analysis assumed that you could get individual-level data on income, preschool enrollment, and Kindergarten test scores. You might be able to do this in a survey dataset like the ECLS-K, but more generally you might find yourself working with averages at the school district level for instance.
Question 7. We might consider weighting by total district population/enrollment in the previous regressions because
(a) larger school districts are more important than smaller school districts
(b) the dependent and independent variables, which are averages, will be estimated more precisely for larger school districts
(c) population is an omitted variable
(d) all of the above
Question 8. In the regression following this question, r_1 is fall K reading test score, incthous_1 is household income in thousands in fall K, and age_1 is the child's age in months at the time of taking the test.
Richard's friend Kyle is 3 months younger than him, but attended the same school and was in the same cohort, and their families have about the same total household income. If they took this exam on the same day in Kindergarten, what do you expect Kyle's score would be relative to Richard's? (For your information, a standard deviation of r_1 is about 10.)
(a) about 1.14 scaled points more (or 11.4% of a standard deviation)
(b) about 0.38 scaled points less (or 3.8% of a standard deviation)
(c) about 0.38 scaled points more (or 3.8% of a standard deviation)
(d) about 1.14 scaled points less (or 11.4% of a standard deviation)
Question 9. Download the dataset kindergarten_version2.dta from the course website.
Generate a new variable being a child's growth on the math exam score from Fall K to Spring K, m_2 minus m_1. Generate a new variable equal to the child's age growth in month, age_2 minus age_1. Regress math growth on age growth and enroll_1. Use robust standard errors.
Let βE be the coefficient for enroll_1 in this regression. Consider the hypothesis test
H0 : βE = 0
HA : βE < 0
What's the smallest significance level at which you can reject the null hypothesis? (Careful -this is a one-tailed test!)
(a) about 1.2%
(b) about 11.8%
(c) about 5.9%
(d) about 4.1%
Question 10. The ECLS-K uses a complicated sampling scheme, and to account for this the National Center for Education Statistics (NCES) includes sampling weights sample_weight which they recommend we use in estimation.
Re-run your previous regression using these sample weights (put " [w=sample_weight]η before the comma in your regression.)
With this new specification, what's the smallest significance level at which you can reject the null hypothesis?
(a) about 0.1%
(b) about 10.4%
(c) about 5.2%
(d) about 3.4%
Question 11. Finally, the ECLS-K is a clustered sample. This means that the NCES first samples schools and then samples students within schools. This sampling approach violates OLS assumption 2. simple random sample, since the NCES is not "shaking up the whole countryη and drawing children at random. Because child outcomes are probably positively correlated within school, the standard errors are likely overstated.
One very general (and in many ways "hands-freeη) way to control for this is to use "cluster-robust" standard errors. As the name implies, these standard errors are robust to heteroskedasticity, and also take into account within-cluster correlation.
Try it. replace the "robust" option in your current regression with "vce(cluster schlid)". This will tell Stata to calculate cluster-robust standard errors, where the clusters are school IDs.
With this new specification, what's the smallest significance level at which you can reject the null hypothesis?
(a) about 100%
(b) about 15%
(c) about 30%
(d) about 10%
Question 12. Open the kindergarten_version2.dta dataset, and plot a histogram of income_1. There are some crazily large incomes. We know that OLS and other expectation-based analyses do not behave well when there are very large outliers. What to do?
One approach is to log very right-skewed variables like this. Apparently the income_1 variable is never less than 1, so this will work in this case. gen logincome_1 = log(income_1). A histogram of logincome_1 is much closer to normal, especially in the upper tail (you can assess this using -qnorm-, as you learned in the last assignment.)
Recall that the test scores were also very right-skewed. Log the math test score, creating a new variable logm_1. Then regress log reading score on log income. In other words, fit the model
log(math score) = β0 + β1log(income) + ∈
The standard approach to interpret this regression is to differentiate both sides w.r.t. income, treating math score as a function of income.
d/ dincome log(math score) = β d/dincome log(income)
I'm guessing it makes sense to assume ∈ is not a function of income under OLS assumption 1.
If you play around with this expression you'll get
%Δmath score = β1%Δincome
or
%Δmath score/%Δincome = β1
thus β1 is interpreted as an elasticity. Which is lovely and very economic.
This approach (using differentiation) has for some reason never confronted me as intuitive, because I cannot see it with discrete changes in income using the original conditional expectations model. Nevertheless, you should remember that in a log-log regression like this, we give β1 the interpretation of an elasticity. it's the % change in the outcome variable expected from a 1% increase in the RHS variable. For example if β1 = 3, then a 3% increase in math score is expected from a 1% increase in income.
According to your estimates, a 1% increase in income is associated with about a
(a) 0.12% increase in math test score
(b) 1.2% increase in math test score
(c) 12% increase in math test score
(d) 1.9% increase in math test score