Reference no: EM131082557
Part I. Interpreting regression results
All of the analyses in this section used a data set containing measurements of variables of characteristics of 40 married couples and the first child born to each of those 40 married couples (you have seen analyses from these data previously in lecture). These variables include the height of the mother and the height of the father at the time of the birth of their first child, the gender of the child, and the eventual height of the child at age 18. These variables are named mother, father, male, and child, respectively. Various relevant output follows (all heights in inches).
For each of the linear/logistic regression analyses conducted with STATA shown below, answer the questions that appear immediately below the output, citing the evidence from the output that supports your answers. This bolded point is very important. None of the questions below can be answered without citing information from the STATA output.
HELPFUL HINTS FOR PART I - For any slope measuring the association between a categorical EV with 2 categories and the RV in linear regression, the value of the slope is equal to the value of the RV mean in one EV category minus the value of the RV mean in the second EV category. Whenever STATA computes these slopes, it always displays the slope as the value of the RV mean in the EV category with the highest numerical code minus the RV mean in the EV category with the lowest numerical code. So if the 2 categories of an EV are coded with the values 1 and 0, the slope will be equal to the value of the RV mean in EV category 1 minus the value of the RV mean in EV category 0.
Simple linear regression of child height predicted by child gender:
reg child male
Y=child (child height)
X1=male (child gender 0=female 1=male)
Source | SSdf MS Number of obs = 40
---------+------------------------------ F( 1, 38) = 26.59
Model | 114.022619 1 114.022619 Prob> F = 0.0000
Residual | 162.952381 38 4.28822055 R-squared = 0.4117
---------+------------------------------
Total | 276.975 39 7.10192308
------------------------------------------------------------------------------
child | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
male | 3.380952 .6556652 5.157 0.000 2.053628 4.708277
_cons | 66 .4750745 138.926 0.000 65.03826 66.96174
------------------------------------------------------------------------------
1) Are the male children taller than the female children in this population? If so, by how much?
2) What is the range of values within which the true value of the difference in height between 18-year-old boys and girls is likely to reside?
Simple linear regression of mother's height predicted by father's height:
reg mother father
Source | SSdf MS Number of obs = 40
---------+------------------------------ F( 1, 38) = 41.53
Model | 86.6284483 1 86.6284483 Prob> F = 0.0000
Residual | 79.2715517 38 2.08609347 R-squared = 0.5222
---------+------------------------------
Total | 165.90 39 4.25384615
------------------------------------------------------------------------------
mother | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
father | .6831897 .1060176 6.444 0.000 .4685683 .897811
_cons | 18.84159 7.329374 2.571 0.014 4.004054 33.67914
3) Do taller men tend to marry taller women in this population?
Mulitple linear regression of child's height predicted by child's gender and mother's height.
. reg child male mother
Y=child (child height)
X1=male (child gender)
X2=mother (mother's height)
Source | SSdf MS Number of obs = 40
---------+------------------------------ F( 2, 37) = 91.70
Model | 230.477536 2 115.238768 Prob> F = 0.0000
Residual | 46.4974635 37 1.2566882 R-squared = 0.8321
---------+------------------------------
Total | 276.975 39 7.10192308
------------------------------------------------------------------------------
child | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------+--------------------------------------------------------------------
male | 2.788491 .3602383 7.741 0.000 2.058579 3.518403
mother | .8503315 .088333 9.626 0.000 .6713517 1.029311
_cons | 10.14665 5.807782 1.747 0.089 -1.621034 21.91433
------------------------------------------------------------------------------
4) For every one-inch increase in mother's height, by how much would you expect child's height to increase, after controlling for possible confounding due to child gender? Is this adjusted association statistically significant?
Mulitple linear regression of child's height predicted by child's gender, mother's height and father's height.
. reg child male mother father
Y=Child (child height)
X1=male (child gender)
X2=mother (mother's height)
X3=father (father's height)
Source | SSdf MS Number of obs = 40
---------+------------------------------ F( 3, 36) = 65.33
Model | 233.994448 3 77.9981494 Prob> F = 0.0000
Residual | 42.9805517 36 1.19390421 R-squared = 0.8448
---------+------------------------------
Total | 276.975 39 7.10192308
------------------------------------------------------------------------------
child | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
male | 2.901731 .3572694 8.122 0.000 2.177155 3.626307
mother | .6907186 .1267338 5.450 0.000 .4336906 .9477467
father | .2026242 .118058 1.716 0.095 -.0368085 .4420569
_cons | 6.628294 6.020587 1.101 0.278 -5.582023 18.83861
------------------------------------------------------------------------------
5) For every one-inch increase in mother's height, by how much would you expect child's height to increase, controlling for child gender and father's height?
6) What is the predicted height for the daughter of a 70-inch mother and a 74-inch father?
Multiple logistic regression of child's gender predicted by mother's height and father's height:
logistic male mother father
Y=male (child gender - 0=female 1=male)
X1=mother (mother's height)
X2=father (father's height)
Logit estimates Number of obs= 40
LR chi2(2) = 2.60
Prob> chi2 = 0.2731
Log likelihood = -26.377775 Pseudo R2 = 0.0469
------------------------------------------------------------------------------
male | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
mother | 1.469965 .3722595 1.521 0.128 .8948401 2.414729
father | .7659505 .1771988 -1.153 0.249 .4867203 1.205374
7) Can giving birth to a boy (RV=1) be predicted at a greater-than-chance level in the population from the parents' heights combined?
8) Is either mother's or father's height individually associated with the probability that one's first child will be a boy?
Part II - ANOVA through regression
you addressed the vital issue of whether different cookie types lead to more or less milk consumption among children enrolled in public elementary schools in the Los Angeles Unified School District (LAUSD) in order to generate knowledge that would help to prevent LAUSD from going bankrupt due to excessive spending on milk (imagine a tax increase for this reason...).
You conducted a test of the hypothesis that cookie type (a 3-category EV) is associated with milk consumption (measured in ounces) in the population of LAUSD elementary school students using a sample of students obtained from LAUSD, in which each student was randomly assigned to eat 1 of 3 cookie types (chips ahoy, oreos, fig newtons). You recorded how much milk each child consumed while eating the cookies. The hypothesis-testing method you used was analysis of variance (ANOVA) with follow-up tests of differences in milk consumption between all possible pairs of cookie categories using Tukey's multiple comparison procedure.
you have learned that you can test the association between a 3-category EV and a quantitative RV using multiple linear regression. This requires creating g-1 dummy variables (where g is the number of groups, or categories of the EV) and including the dummy variables as EVs in a multiple regression. This "ANOVA through regression" is preferred to traditional ANOVA with follow-up tests if 2 conditions are met:
-you can test your hypothesis while limiting the number of comparisons of pairs of group means among the set of 3 group means to g-1, or 2 comparisons.
-each of the comparisons you test has the same reference group, or comparison group
So you are going to repeat the analysis you completed testing for differences in milk consumption between/among cookie types. This time, you will assume that school district administrators hypothesized that children who ate fig newtons would consume significantly less milk than children who ate chips ahoy and children who ate oreos. So these are the 2 comparisons that you will test statistically, and you will do so by creating 2 dummy variables and including them as EVs in a multiple linear regression analysis.
Here are the sample means for each cookie type (not all of these will appear in your regression output), and the null and alternative hypothesis statements:
EV - cookie type (3 categories)
|
RV - milk consumption (in ounces)
|
- Chocolate chip (n=18)
- Oreo (n=18)
- Fig Newton (n=18)
|
X- = 12.0 ounces
X- = 10.1 ounces
X- = 4.2 ounces
|
|
|
H0: µ1 (chips ahoy) = µ2 (oreo) = µ3 (fig newton)
Ha: at least one µ ≠ another µ
Use α = .05
Using the data from the original problem #8 on the course website, create two dummy variables from the 3-category COOKIETYPE variable. Be very careful when deciding which cookie categories should receive values of 1 on each of the two dummy variables, and which cookie categories should receive values of 0 on the two dummy variables. Also think carefully about which cookie represents the reference category. This is the cookie that should receive codes of 0 on both dummy variables.
It should be noted that you did not complete a lab exercise in which you created dummy variables and included them as EVs in a multiple linear regression. But you can create these dummy variables using either the GENERATE (GEN) or RECODE functions in STATA. You have instructions from previous lab sections that will help you to do this. After creating the 2 dummy variables:
9) Run the multiple linear regression analysis with milk consumption as the RV and the 2 dummy variables included as EVs using STATA. Paste your output below.
10) Interpret the value of the y-intercept in your output
11) Interpret the value of the slope measuring the association between your FIRST dummy variable and milk consumption. Does the t-test for this slope indicate a significant difference in milk consumption between the fig newton group and another group? Report the t-statistic and p-value in support of your answer.
12) Interpret the value of the slope measuring the association between your SECOND dummy variable and milk consumption. Does the t-test for this slope indicate a significant difference in milk consumption between the fig newton group and another group? Report the t-statistic and p-value in support of your answer.
Part III. Choosing Statistical Tests, conducting those tests and interpreting the results of those tests
We noted on the midterm that an important skill necessary for conducting applied social science research is to be able to translate a research question into the correct choice of a statistical test. It is just as important to be able to subsequently conduct the statistical test and interpret the results of that test.
For each of the following research questions below, you will not only choose the appropriate statistical hypothesis testing procedure, but you will also run the test using data that we provide and interpret the results. Consider each of the research questions below and answer the questions that follow:
13) Do the annual incomes of men and women differ after holding years of education, hours worked per week and years employed at current job constant?
There is a data file on the CCLE course website named FINALPARTIII13 containing GSS data that you will use to conduct the statistical test that you choose. The following variables are in the file:
CONRINC - annual income of the respondent
SEX - gender of the respondent 1=male 2=female
EDUC - number of years of education completed by the respondent
HRS1 - number of hours worked in the last week (we are using this as a more general measure of hours worked per week)
YEARSJOB - years employed at current job
a) identify the EV
b) identify the RV
c) identify the CV(s)
d) state a hypothesis about the direction of the expected association between the EV and the RV that you identified in a and b above. You are not being asked to state null and alternative hypotheses here. Just state the hypothesis that you would as the researcher conducting this study
e) identify the statistical test that will appropriately test whether or not the EV and RV are associated in the population after controlling for the CV(s), and explain why you chose this particular test
f) statethe null and alternative hypotheses for the expected association between the EV and RV you identified in a and b above, after holding all CVs constant that you identified in part c above
g) conduct the statistical test you chose in part e above in STATA using the data provided on the CCLE course website. Paste the output below
h) write a short summary of the results in which you state your decision to reject or retain the null hypothesis you stated in f above. In your summary, identify the values the association between all EVs/CVs combined and the RV and its test statistic/p-value. Also identify the values of the individual adjusted EV/RV association, the adjusted CV/RV associations, their test statistics and p-values
i) what is you main conclusion about the presence or lack of presence of an EV/RV association in the population after holding the CV(s) constant?
Part III (CONTINUED). Choosing Statistical Tests, conducting those tests and interpreting the results of those tests
14) Does the probability that someone reports being happy (HAPPY) differ for whites and non-whites (WHITE) after holding constant other indicators of quality of life like health status (GOODHEALTH), feelings of safety (SAFE), and poverty level (NOPOVERTY)?
There is also a data file on the CCLE course website named FINALPARTIII14 containing GSS data that you will use to conduct the statistical test that you choose. The following variables are in the file:
HAPPY - responses to question asking how happy a respondent is with his/her life. 1=respondent reports being "very happy" or "happy", 0 =respondent reports being "not so happy"
WHITE - race/ethnicity 1=non-Hispanic White, 0=all others
GOODHEALTH - responses to question asking the respondent to rate his/her overall health. 1=respondent reports "very good" or "good" health, 0=respondent reports "fair" or "poor" health
SAFE - responses to question asking whether or not respondent feels afraid when walking in his/her neighborhood at night. 1=no 0=yes
NOPOVERTY - 1=respondent does not live in poverty, 0=respondent lives in poverty
On the variables GOODHEALTH, SAFE and NOPOVERTY, values of 1 indicate higher quality of life and values of 0 indicate lower quality of life.
a) identify the EV
b) identify the RV
c) identify the CV(s)
d) state a hypothesis about the direction of the expected association between the EV and the RV that you identified in a and b above. Again, you are not being asked to state null and alternative hypotheses. Just state the hypothesis that you would as the researcher conducting this study
e) identify the statistical test that will appropriately test whether or not the EV and RV are associated in the population after controlling for the CV(s), and explain why you chose this particular test
f) statethe null and alternative hypotheses for the expected association between the EV and RV you identified in a and b above, after holding all CVs constant that you identified in part c above
g) conduct the statistical test you chose in part e above in STATA using the data provided on the CCLE course website. Paste the output below
h) write a short summary of the results in which you state your decision to reject or retain the null hypothesis you stated in f above. In your summary, identify the values of the individual adjusted EV/RV association and CV/RV associations, their test statistics and p-values
i) what is you main conclusion about the presence or lack of presence of an EV/RV association in the population after holding the CV(s) constant?
Part IV. Multiple logistic regression: Main effects and interaction models
The Western Collaborative Group Study (WCGS) is a 25-year longitudinal study of Coronary Heart Disease (CHD) among men that began in 1960. Participants were age 39-59 at the beginning and were also determined not to have heart disease at that time.
A distinctive feature of the WCGS was that it investigated behavioral and personality risk factors for heart disease, with a focus on the Type A personality as a risk factor for heart disease. And the question of whether the Type A personality was a risk factor for CHD was central to the study.
The Type A personality reflects a lifestyle characterized by taking on more than one can handle, multitasking, and high stress levels in all areas of life. The Type B personality reflects a much more relaxed lifestyle. Type A/B classification is determined by responses to questions in a questionnaire called the Jenkins Activity Survey. The original sample was contacted annually for follow-up interviews and measurements of whether or not each participant acquired CHD.
The following STATA output is from a logistic regression analysis predicting whether a participant had acquired CHD by 1969. The explanatory variables include Type A/Type B classification and other risk factors for CHD that could be confounding variables of any observed association between Type A personality and CHD. These possible confounding variables included age, blood pressure, cholesterol, smoking status, and BMI. A summary of all variables appears below, followed by the results of the main effects logistic regression analysis:
Variable Label
chd69 RV:Any CHD event (heart attack or angina (0=no CHD,1=CHD)
typea EV:Type A/B classification (0=Type B,1=Type A)
age CV:Participant age in years
chol240CV:Cholesterol above/below 240 (0= <240 "good", 1= >=240 "bad")
sbp140 CV:Systolic BP above/below 140 (0= <140 "good" 1= >=140 "bad")
overweightorobese CV:BMI above/below 25 (0=<25 normal 1=>=25 overweight/obese)
smoke CV:Smoking Status (0=non-smoker 1=smoker)
Main effects model for CHD at 10 years (1969)
Logistic regression Number of obs= 3142
LR chi2(6) = 152.29
Prob> chi2= 0.0000
Log likelihood = -813.45397 Pseudo R2= 0.0856
-------------------------------------------------------------------------------
chd69 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
typea | .7440875 .1433427 5.19 0.000 .4631409 1.025034
age | .0595851 .0118483 5.03 0.000 .0363629 .0828074
chol240 | .7380767 .1352749 5.46 0.000 .4729427 1.003211
sbp140 | .5555681 .1451587 3.83 0.000 .2710623 .8400738
overweighto~e | .2184401 .1378186 1.58 0.113 -.0516795 .4885597
smoke | .5939591 .1386625 4.28 0.000 .3221856 .8657325
_cons | -6.579158 .5891206 -11.17 0.000 -7.733814 -5.424503
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
chd69 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
typea | 2.10452 .3016676 5.19 0.000 1.589057 2.78719
age | 1.061396 .0125757 5.03 0.000 1.037032 1.086333
chol240 | 2.091908 .2829827 5.46 0.000 1.604709 2.727023
sbp140 | 1.742931 .2530015 3.83 0.000 1.311357 2.316538
overweighto~e | 1.244134 .1714649 1.58 0.113 .9496332 1.629967
smoke | 1.811145 .2511378 4.28 0.000 1.380141 2.376746
_cons | .001389 .0008183 -11.17 0.000 .0004378 .0044073
-------------------------------------------------------------------------------
15) Refer to the main effects model results above and interpret the adjusted odds ratios measuring the association between the main EV (Type A/B personality) and the RV (CHD after 10 years), and between each CV and the RV. You don't have to interpret the values of the LR coefficients in the output above.
Interaction model for CHD at 10 years (1969)
The interactive effect of Type A/B classification (EV1) and overweight/normal weight classification (EV2) on CHD 10 years later (RV) was tested because you were interested in determining if the association between Type A personality and CHD 10 years later differed between overweight and normal weight individuals. So Type A personality is the main EV.
Logistic regression Number of obs= 3142
LR chi2(7) = 152.70
Prob> chi2= 0.0000
Log likelihood = -813.25227 Pseudo R2= 0.0858
-------------------------------------------------------------------------------
chd69 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
typea | .552525 .1925011 2.87 0.004 .1752254 .929825
age | .059724 .0118595 5.04 0.000 .0364799 .0829681
chol240 | .7373236 .1353085 5.45 0.000 .4721237 1.002523
sbp140 | .5543145 .1452926 3.82 0.000 .2695463 .8390826
overweighto~e | .095618 .2381595 0.40 0.688 -.3711662 .5624021
smoke | .594825 .1387015 4.29 0.000 .3229751 .8666748
typea_by_over | .4056084 .19882762.04 0.041 .0159063 .7953105
_cons | -6.531349 .5936486 -11.00 0.000 -7.694879 -5.367819
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
chd69 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
typea | 1.737551 .3729809 2.87 0.004 1.191515 2.534066
age | 1.061544 .0125893 5.04 0.000 1.037153 1.086507
chol240 | 2.090333 .28284 5.45 0.000 1.603396 2.72515
sbp140 | 1.740747 .2529176 3.82 0.000 1.30937 2.314243
overweighto~e | 1.100339 .2620561 0.40 0.688 .6899293 1.754883
smoke | 1.812714 .251426 4.29 0.000 1.381231 2.378987
typea_by_over | 1.500215 .34537 2.04 0.0411.016033 2.21513
_cons | .001457 .000865 -11.00 0.000 .0004552 .0046643
16) Refer to the interaction model and interpret the test of Type A personality-by-BMI category interaction. If it is significant, it will be necessary for you to thoroughly describe how the size of the effect of Type A personality on CHD within 10 years is modified by BMI. In your interpretation:
- report and interpret two ORs. The first OR should measure the association between Type A/B personality and getting CHD for the normal weight group. And the second OR should measure the association between Type A/B personality and getting CHD for the overweight group (recall that one of these is already in the table, and the other has to be calculated from the relevant logistic regression coefficients in the output above)
-cite evidence from the STATA output indicating whether these 2 ORs differ significantly or do not differ significantly
Here is the coding of the relevant variables shown at the beginning of Part IV for your reference:
CHD: 0=absent, 1=present
Type A/B classification: 0=Type B, 1=Type A
BMI category: 0=normal weight, 1=overweight or obese