Reference no: EM132253825
Part A -
This first part is based on the "IPEDS" dataset, a national database of institutions of higher learning. The data you'll examine includes a random sample of four-year, non-profit institutions in 2016. The output you need to answer these questions is given after the last question of Part A. For these and all questions, please type your answers directly into this document.
IPEDS Data:
grad.rate: the percent of students who graduate within four years. Values are integers that represent percents. For example, 23 means the graduation rate is 23%.
sector: Indicates whether public or private school.
Average.loan: in dollars, the average student debt at graduation.
SAT.read.25p: the 25th percentile of the SAT reading score for enrolled students.
SAT.math.25p: the 25th percentile of the SAT math score for enrolled students.
SFR: student-faculty ratio. (Number of students per full-time faculty member).
1. According to the output, do public schools and private schools differ in mean graduation rates, controlling for the other variables? Give an answer ("yes" or "no") and explain how you reached this conclusion. Use a 5% significance level.
2. Why do the p-values for SFR differ in the anova and summary tables? Choose the best:
a) The anova table is testing the null hypothesis that the slope for SFR is 0, and the summary table is testing the null hypothesis that the slope for SFR is 1.
b) The anova table is testing the null hypothesis that the slope for SFR is 0 given that SAT.math.25p, SAT.reading.25p, and Average.loans are all in the model, while the summary table is testing that the slope is 0 given that SAT.math.25.p, SAT.reading.25p, Average.loans, and sector are in the model.
c) The anova table is based on the F statistic while the summary table is based on the t-statistic.
d) The anova table is testing the null hypothesis that the slope for SFR is 0 given that all of the other variables are included in the model, and the summary table is testing the null hypothesis that the slope for SFR is 0 given that no other variables are included in the model.
3. Which of the following is the best interpretation of the coefficient for sector? (Assume the model is valid.) Indicate the best choice.
a) Among all schools with similar loan amounts, similar SAT reading and Math 25th percentiles, and similar student-faculty ratios, the graduation rate at public universities is about 2.6 percentage points lower, on average, than at private universities.
b) The graduation rate at public universities is about 2.6 percentage points lower than at private universities.
c) The mean graduate rate at public universities is about 2.6 percentage points lower than at private universities.
4. What is the interpretation of a 5% significance level the context of testing whether public and private schools differ in mean graduation rates? Indicate the best interpretation from among these:
a) The probability that we will conclude that public and private schools differ in mean graduation rates when, in fact, they are the same, is 5%.
b) If public and private schools do not differ in mean graduation rates, then the probability of getting a test statistic as extreme or more extreme than 0.976 is 5%.
c) The probability that public and private schools differ in mean graduation rates is 5%
d) The probability that public and private schools have the same mean graduation rates is 5%.
5. Notice that the last line in the anova table has been removed. SYY = 355,246. Give the values to fill in the rest of the table:
Df:
Sum Sq:
Mean Sq
F value
PR(>F)
6. A politician sees this analysis and notes that the coefficient for student-faculty ratio is negative and statistically significant. He says "This analysis shows that if we lower student-faculty ratios, then graduation rates will increase." This interpretation is
a) Valid
b) Invalid
7. The p-value of 0.04710 for SFR is best interpreted as the probability that the null hypothesis is correct. Is this a valid statement?
a) Valid
b) Invalid
8. The p-value of 0.04710 for SFR is best interpreted as the probability the null hypothesis is wrong. Is this a valid statement?
a) Valid
b) Invalid
9. Suppose you have fit two different models to predict the salary of a worker in the U.S. based on a number of different predictor variables. Model 1 has R2 of 90% and the residual plot shows a trend. The other diagnostic plots look good. Model 2 has an R2 of 60% and all of the diagnostic plots look good. Which model should you use?
a) Model 1
b) Model 2
10. Explain your choice for (9).
PART B -
In this part, please upload the provided data set into R. The dataset is in "FinalData.csv"
You are expected to turn in a .R file that includes all commands you used to prepare your answers for Part B. (You need not include any calculations you may have performed for part A.)
The variable size is the sum of the variables ThoraxLength, ClawLength, and ClawHeight.
1. Fit a basic linear model using only size, Weight and Sex to predict pinching force. Do not do any transformations or higher order terms.
a) Write the equation of the model: Predicted_PinchingForce=
b) Comment on the model validity with respect to these three conditions. Type the word "is" or "isn't" and then give your reason.
Linear trend condition [is or isn't?] satisfied
Constant Variance condition [is or isn't?] satisfied
Normal distribution of errors condition [is or isn't?] satisfied
2. Create an Inverse Response Plot. What transformation of PinchingForce provides the lowest residual sums of squares?
3. What transformation of PinchingForce is suggested by the Box-Cox transformation?
4. Fit the model using the transformation for PinchingForce based on the Box-Cox power transform (using the "Rounded" power). Which model do you think is better, in terms of model validity: the "basic" model in question1 or this model? Explain.
5. At the midterm, we found that the pinching force for male crabs was greater than for female crabs. Explain why this is not the case with the current model. (Hint: note that male crabs tend to be bigger and heavier than females.)
6. Give the variance inflation factors for each variable for the transformed model from question 4.:
Sex
Weight
Size
7. What do these values for vif tell us in this context?
8. Perform best subsets regression, forward stepwise, and backward stepwise to develop the "best" model, using BIC as a criteria. Use your transformed version of PinchingForce. Include these predictors to start: Weight, ThoraxLength, Sex, ClawLength, ClawHeight, ClawWeight. Note that you may get three different models from each of these three approaches. Choose the one with the lowest BIC. Be sure to state the BIC value for your choice. Use this model to answer these questions:
a) Give the equation for the final model you chose:
b) BIC for final model:
c) Suppose we had just caught a coconut crab with these measurements:
ThoraxLength: 52
Weight: 615
Sex: Male
ClawLength: 67
ClawHeight: 26
ClawWeight: 34
Predict it's pinching force at a 95% level (give the appropriate interval)
9) Consider the output see in attached file:
a) What null and alternative hypotheses does the F-statistic test?
b) What do you conclude based on the p-value (using a significance level of 0.05)?
c) Extra Credit: What's going on here?
Attachment:- Assignment Files.rar