Reference no: EM132352704
Assignment - Answer all questions.
Instructions: You must use SAS to obtain appropriate analyses of the data where required and use the SAS output obtained in answering the questions that follow the problem statement. Use a 5% level of significance (α = 5%) where appropriate unless otherwise specified. (Use 3 decimal places in your calculations and answers). You must also submit your SAS program file.
The Problem - Air Pollution
A climatologist is interested in predicting air quality in cities in the USA. Air quality is measured by the mean concentration of sulphur dioxide (SO2) in the air. Information pertaining to 7 (possible) explanatory variables was gathered over a 3-year period. Full data was collected for 41 US cities.
Data collected:
Sulphur dioxide (SO2)
|
Average sulphur dioxide content of the air in micrograms per cubic metre.
|
Temp
|
Average annual temperature in degrees Fahrenheit
|
Factory
|
Number of manufacturing enterprises employing 20 or more workers
|
Pop
|
Population in thousands in 1999
|
Wind
|
Average annual wind speed
|
Precip
|
Average annual precipitation in inches
|
Dayrain
|
Average number of days with precipitation per year
|
Dust
|
Average concentration of dust particles in ppm (parts per million)
|
The data is available on Canvas as 'Pollution.txt'.
Use SAS to run a full multiple linear regression analysis on the data, in order to answer the questions that follow.
Questions -
(i) Copy the ANOVA table for the full model into the table below. Explain how the degrees of freedom for each component of variation is calculated.
Source
|
df
|
Sum of squares
|
Mean square
|
F
|
Model
|
|
|
|
|
Error
|
|
|
|
|
Total
|
|
|
|
|
(ii) Use an F test to test the global significance of the full model applied to the data, include your hypothesis, test statistics (i.e. F and critical F) and your full conclusion.
(iii) State the value of the coefficient of determination for the full model and explain how it relates to the SSError.
(iv) Examine the SAS output and test the significance of each of the individual explanatory variables. You should include your hypotheses, and explain how you have drawn your conclusions.
(v) Use a stepwise procedure to obtain the optimal (best fit) model. State the least squares estimate of this model, explaining each of the terms in your model.
(vi) State and describe how you would interpret the R2, adj R2 and Cp statistics to identify the optimal model obtained in part (iv).
(vii) Use the optimal model to predict SO2 for a city where the average temperature is 60.7oF, there are 350 enterprises with 20 or more workers, a population of 580000, average wind speed of 9.5, average precipitation of 30.0 inches, 150 days of rain on average and a dust concentration of 7.0 ppm. State and interpret the 95% prediction interval for this city.
(viii) Explain how you would calculate residuals for the optimal model. You may obtain and use residuals from the SAS output to aid your explanation.
(ix) Use the residual plots provided in the SAS output to assess the validity of assumptions that underlie the model. Refer specifically to the distribution of the residuals, the mean of the residuals and homoscedasticity.