Reference no: EM132400445
PUBHBIO 7220: Applied Logistic Regression, Ohio State University, USA
Questions
I. Using the Cleveland data set, fit a regression model with HD_diag as the response and sex trestbps chol, and exang as the covariates. Include your output here.
A. Evaluate the goodness of fit of the model using both Hosmer-Lemeshow test (using deciles of risk) and the Pearson Chi-square statistic. Assess whether the results of the 2 tests are consistent.
II. On the basis of the logistic model with sex trestbps chol, and thalach as covariate, estimate the sensitivity and specificity of classifying subjects as having or not having heart disease diagnosis using the cut-off values for the probability of heart disease of 0.5.
A. Repeat the previous exercise using the cut-off point specified below and fill in the table.
Cut-off
|
Sensitivity
|
Specificity
|
0
|
|
|
0.1
|
|
|
0.2
|
|
|
0.3
|
|
|
0.4
|
|
|
0.6
|
|
|
0.8
|
|
|
1
|
|
|
Draw the ROC curve by hand using the values from the table.
B. Use stat to obtain the ROC curve. What is the discriminatory power of the model?
C. Suppose someone had fraudulently accesses your computer and altered the data of the dependent variable in such a way that the coefficients of the model would remain the same. However, the predicted probabilities of the outcome would be largely affected. What would happen to the goodness of fit statistics?
III. Fit a model with sex and fbs as covariates. Assess the overall fit of the model and its discriminatory power by conduct the Hosmer-Lemeshow goodness of fit test and calculate the area under the curve.
A. Estimate the predicted probability of the outcome.
The data in hyponatremia.dta derive from an epidemiological study of hyponatremia (a life threatening condition) among runners of the 2002 Boston Marathon. Hyponatremia is defined as an electrolyte disturbance in which the serum sodium concentration is lower than normal (<135 mmol/l). The aim of the study was to determine whether a runner experienced hyponatremia and to identify the principal risk factors.
Participants in the 2002 Boston Marathon completed a survey including demographic and anthropometric characteristics (BMI) one or two days before the race. After the race, runners provided a blood sample in order to measure their serum sodium concentration and completed a questionnaire detailing their urine output during the race. Prerace and postrace weights were also recorded. Use the hyponatremia dataset for the following exercises.
IV. Run a logistic regression model with nas135 as the dependent variable and female and urinat3p as covariates and estimate the predicted probability of the outcome. Conduct the Hosmer-Lemeshow goodness of fit test and calculate the area under the ROC curve.
Include your output here.
V. Make a frequency table with nas135, female and urinat3p.
A. Open a new Stata session in which you create a dataset with these variables only in aggregated form. Generate a variable named freq which is the frequency of each cell in the table. The new dataset will have 8 rows, 1 for each combination of nas135, female and urinat3p. Run a logistic regression model with nas135 as the dependent variable and female and urinat3p as covariates. Include your output here.
B. What are the estimated the predicted probability of the outcome?
C. Compare the coefficients and estimated predicted probability of outcome for this model to those of the model using the original dataset.
D. Alter the odds of the outcome for each of the 4 female and urinat3p combinations:
create a new variable, named fakefreq, that has the value of 31 (nas135=1, female=0, urinat3p=0), 45 (nas135=1, female=1, urinat3p=0), 6(nas135=1, female=0, urinat3p=1), and 6(nas135=1, female=1, urinat3p=1). The total number of observations in each of the 4 subgroups should not change, therefore the frequency for the nas135=0 cells should change accordingly.
E. Fit a model with female and urinat3p using fakefreq instead of freq as weight. Compute the estimated probabilities of the outcome and compare them with those estimated from the original data.
F. Conduct the Hosmer-Lemeshow goodness of fit test and calculate the area under the ROC curve. Compare both statistics with those obtained from the original data.
VI. Fit the model with runtime, wtdiff, bmi and bmi2 as covariates using the original dataset where bmi2=bmi*bmi. Compute the leverage h, the change in chi-square ΔX2 , the change in deviance and the influence diagnostic Δβ. Plot each of these versus the fitted probabilities.
A. Identify the covariate pattern with the highest value of Δχ2 and ?β. On the basis of this logistic model, estimate the sensitivity and specificity of classifying subjects as having or not having hyponatremia using the cut-off value for the probability of hyponatremia of 0.5.
B. Delete, each at a time, covariate pattern 423, 22, and 334, and assess the change in model coefficients.
Using the ICU data:
VII.Consider the variable level of consciousness at ICU admission (LOC) as a covariate and vital status (STA) as the outcome variable. Compare the estimates of the odds ratios obtained from the cross-classification of STA by LOC and the logistic regression of STA on LOC.
Use LOC=0 as the reference group for both methods. How well did the logistic regression program deal with the zero cell?
A. What strategy would you adopt to modeling LOC in future analyses?