Reference no: EM132163733 
                                                                               
                                       
Forecasting Assignment - Regression and Box-Jenkins Methodology
Question 1 - Save the file sexratio.sas7bdat in your TSWORK library. This data consist of annual data for the Australian sex ratio recorded as a percentage from 1796 to 2007, where for example SexRatioPCT of 101 means 101 men for every 100 women.
(a) Run the following SAS code:
data temp; set tswork.sexratio;
proc print; run;
proc gplot;
plot SexRatioPCT*Year; run;
Produce the time series plot of the data.
Describe the general patterns you observe in this series.
Some would argue that the large values observed in the initial 100 or so observations have little relevance to future values of this series. Relate to the history of British settlement of Australia, explain why this is the case.
(b) Run the following SAS code:
proc gplot;
where Year > 1900 ;
plot SexRatioPCT*Year; run;
Produce the time series plot of the data.
There were two significant troughs in the data. Explain what could have caused them.
(c) Run the following SAS code, which forecasts using Exponential Smoothing:
Proc forecast data = temp method=expo trend=2
out = tswork.exp outfull;
where Year > 1900 ;
var SexRatioPCT;
id Year;
run;
proc gplot data=tswork.exp;
where Year > 1900 ;
symbol i=line v="*" h=1;;
plot SexRatioPCT*Year=_TYPE_;run;
Produce the SAS output plot generated from the above code, which contains both the observed and forecasted series, and the 95% CI associated with the forecasts.
What do your forecasts look like? Do you think this is a reasonable forecast (in particular, pay attention to the last 5 observations of your data)?
(d) Modify the above SAS code, using the LAST 10 observations only (Hint: change the line of code for "Year > ....".
Produce the SAS output plot generated from the modified code.
Compare to your result in part (c), what significant difference do you observe? Do you think this forecast is more reliable than the previous one?
Question 2 - Save the SAS data file ARRDEP.sas7bdat in your TSWORK library. This data consists of three sets of monthly Australian arrivals and departures data. Specifically, there are permanent departures (PD) and arrivals (PA), long-term departures (LTD) and arrivals (LTA), and short-term departures (STD) and arrivals (STA). The data runs from 1976 to 2011. In this question we will use only the short-term departures (STD) data.
(a) View the data through Explorer. You will see that there is a "month" variable and twelve dummy variables representing the twelve months of the year. Would you expect there to be trend and seasonality in the STD series? Why?
(b) Run the following SAS code and describe the general patterns that you see.
data temp; set tswork.ARRDEP;
symbol1 i=line;
proc gplot; plot STD*month; run;
(c) Run the following SAS code, which calculate a log transform for the STD series and plots the series. What does this transformation achieve?
data temp; set tswork.ARRDEP;
logSTD = log(STD);
proc gplot; plot logSTD*month; run;
(d) Run the following SAS code, which fits a regression model for the log-transformed STD series. It specifies "month" and the dummy seasonal variables as predictors and asks for the Durbin Watson statistics. Comment on your coefficient estimates and explain if they are consistent with your expectations.
data temp; set tswork.ARRDEP;
logSTD = log(STD);
proc reg;
model logSTD = month jan feb mar apr may jun jul aug sep oct nov
dec/dw; run;
(e) Use the Durbin Watson statistic and a suitable table of critical values to test for first order serial correlation in the residuals. What does this value tell you about the reliability of your regression equation and any forecasts obtained using this model?
(f) Run the following SAS code, which progressively produces ACF and PACF for the log-transformed STD series at different levels of differencing. Base on the ACF, at each step comment on the stationarity of the series.
data temp; set tswork.ARRDEP;
logSTD = log(STD);
/* no differencing */
proc arima data=temp;
identify var=logSTD; run;
/* 1st differenced */
proc arima data=temp;
identify var=logSTD(1); run;
/* both 1st and seasonally differenced */
proc arima data=temp;
identify var=logSTD(1,12); run;
(g) Based on the ACF and PACF from the last step of part (f), select a SARIMA model. Justify the values of p, d, q, P, D and Q you have chosen by commenting on the patterns observed in ACF and PACF.
(h) Now fit your selected model and forecast values for the suitably transformed series for 1 year ahead. Use blue for the forecasts and red for the prediction intervals. The skeleton of the code has been provided for you below, and you are to fill in the missing parts represented by "???".
data temp; set tswork.ARRDEP;
logSTD = log(STD);
proc arima data=temp;
identify var=logSTD(???,???);
estimate p=(???) q= (???) plot;
forecast lead=??? id=month out=forecast; run;
proc print data=forecast; run;
symbol1 i=line v=none c=???;
symbol2 i=line v=none c=???;
symbol3 i=line v=none c=???;
proc gplot data=forecast;
plot (U95 forecast L95)*month/overlay; run;
g) Run the codes you have completed in part (h), and write down the estimated equation for your transformed series using the backshift operator.
(i) Consider the ACF plot for you model residuals = e(t). Include this plot in you assignment and explain what it means. Do you think there is a need to revise your model?
(j) Finally, take the forecasts and data from part (i) and back-transform them (i.e., undo the log-transformation by taking exponential of the data). Then plot your forecasts and confidence intervals. Use blue for the forecasts and red for the prediction intervals, and include the original data as black star points. The skeleton of the code has been provided for you below, and you are to fill in the missing parts represented by "???". Include the final plot to your answer.
data temp; set forecast;
expU95=???;
expL95=???;
expforecast = ???;
STD=exp(logSTD);
symbol1 i=line v=none c=???;
symbol2 i=line v=none c=???;
symbol3 i=line v=none c=???;
symbol4 i=none v=??? c=???;
proc gplot data=temp;
plot (expU95 expforecast expL95 STD)*month/overlay;
run;
Question 3 - Continue with the ARRDEP data set used in Question 2.
(a) Write a set of SAS code to plot the two series with black for arrivals (STA) and red for departures (STD). To obtain better resolution on more recent data, please plot for data observed after month=250.
(b) Run the code you have written and produce the plot.
(c) Describe any significant patterns observed in the data. Would you expect these two series to follow each other closely? Is there any sort of lag effect between the series? Why?
(d) In real-life consulting work, it is often the case that at the end of a project you will be asked to deliver a full set of codes that covers the entire modelling process. Moreover, you are also often required to provide comments throughout the set of codes so that whoever takes over the project after you will have no problem following what you have done and make changes if need be. In this question, you are required to write a full set of SAS codes that will do the following for the permanent departure (PD) series:
1) Plot PD against time;
2) Log-transform PD (call it logPD);
3) Plot logPD against time;
4) Generate ACF and PACF for the log-transformed data on:
a. The Raw series
b. The 1st differenced series
c. The 1st and seasonally differenced series
5) Select a SARIMA model appropriately chosen based on patterns observed in the ACF and PACF;
6) Estimate the chosen SARIMA model;
7) Forecast for 2 years into the future;
8) Save your forecast output into a file called PDforecast;
9) Back-transform the data and your forecasts;
10) Plot the observed data, the forecasts, and the confidence interval.
You must provide appropriate amount of comments at each step of the code to facilitate the reader's understanding of your code. You must ensure that there are no errors in your code and that I can test it on my machine.
Attachment:- Assignment Files.rar