Reference no: EM133078373
INF 504 Data Mining And Machine Learning - Northern Arizona University
Assignment 1
Question 1. The data for this assignment are available in BbLearn in the file hw1.rds. Show your code and output to complete the following.
(i) Use the function readRDS to read the data into a data frame object that you name hw1.df. See help(readRDS) if you need help.
(ii) Display the first 5 and last 5 rows of your data set-no more and no less, please- thereby giving some indication that the data have been read correctly.
Question 2. Create an appropriate scatter plot matrix, similar to that seen in our notes. You may use the pairs function or a similar function in R. Show your code and plot. No discussion required.
Question 3. How many linear models are possible using all possible subsets of the inputs, including the model consisting only of the intercept, without any inputs? NOTE: If it's not obvious by now, the variable names are suggestive of output and inputs!
Question 4. Use the function, regsubsets, in the leaps package, to perform an exhaustive search for the best model/subset of inputs of each possible size and use the result to create a ‘matrix' of plots, one for each of RSS, R2, R2a, CP, and BIC, each plot indicating the (same) single best model for each of the possible sizes (the difference among these k = 8 best models of a given size across plots being, of course, the different values of the different criteria, which we use to select across size). (Very similar to plots shown in our notes.) For the plot of CP vs. number of inputs, add the reference line, CP = 1 + P (i.e., CP = p where p is the number of parameters in model), just like in our notes.
(i) What relatively simple criterion or criteria can we regsubsets use to select models of the same size? (Granted, you are requested to report only the best of each size (along with computing each of the above criteria values).)
(ii) Show your code and plots, appropriately annotated.
Question 5. Identify the best model(s), over all sizes, as selected by each of the criteria, BIC, R2 and CP -only these three, please. This will result in one or more models, depending on how these criteria (dis)agree or are ambiguous. To be sure, please indicate best model input names and least squares estimates of corresponding parameters (i.e., the learned weights, in machine learning lingo). Discuss how these criteria agree or not and which model(s) you might select and why.
Question 6. Suppose we want to predict the unobserved output, y∗, for a "typical" input, x∗, for which we will use the medians of the training set inputs for expediency of this item. Use your best fitted model, according to BIC, above (regardless of which model(s) you may have chosen there), to compute the predicted value, along with a (nominal) 95% predic- tion interval for y∗. To do this, use the predict function with the interval=‘predict' option. See our INF 511 notes (available in BbLearn) or help(predict.lm) or the Web for examples of the predict function. Show your code/output and summarize briefly your prediction interval. In particular, comment on the validity of the nominal 95% coverage rate and the interval width with unbiasedness and overfitting in mind.
Question 7. Why does it not seem plausible to use the validation set approach, using MSPR (i.e., prediction error, expression (2.3) in our notes, an unbiased estimate of generalization error, expression (2.1) in our notes), for this data set? Please try to be concise.
Question 8. Despite concern's raised by the previous item, let's try the validation approach anyway to see how it works. I prepare and randomly divide the data into n = 20 training observations and n∗ = 13 validation observations for you.
> ## Randomly choose n=20 training cases, with remaining n*=13 for
> ## validation.
>set.seed(24601+711) ## Jean Valjean gets a Big Gulp (reproducibility)
>(ntot<-dim(hw1.df)[1])
>n<-20 ##<-- training set size
>trainindx<-sample(x=1:ntot,size=n,replace=FALSE)
>train.df<-hw1.df[trainindx,]
>val.df<-hw1.df[-trainindx,]
Here, we compare the validation set approach to the previous criteria using the previous eight models obtained from regsubsets, best of each size (according to RSS), and best across size according to whichever of the previous criteria. (We get the best models according to our previous criteria, but we may be missing the best according to the validation set approach by restricting ourselves to these eight models; we would have to conduct the validation set approach on all 28 = 256 models to be sure-not bad-but we ignore this for expediency of this item and just use the eight models previously obtained for our comparison.) I get you started by obtaining for you the fitted/predicted values yi∗ = µ(xi∗ ), i∗ = 1, . . . , n∗ = 13, for each of the eight models on the validation set.
>atemods<-coef(hw1.reg,id=1:8)
>atemuhats<-sapply(atemods,function(bhat,val) {
+X<-as.matrix(val[,names(bhat)[-1],drop=FALSE])
+B<-as.matrix(bhat[-1])
+muhat<-as.vector(bhat[1]+X *?B)
+val= val.df)
>dim(atemuhats)
Now, you estimate generalization error (expression (2.1) in our notes) by computing MSPR (expression (2.3) in our notes) for each of the eight models. (Loosely and popularly, we'd just say "compute prediction error on the validation set for each of the eight models.")
(i) Estimate generalization error (expression (2.1) in our notes) by computing MSPR (expression (2.3) in our notes) for each of the eight models. (Loosely and popu- larly, we'd just say "compute prediction error on the validation set for each of the eight models.") (10 points)
(ii) Create a plot of MSPR vs. model complexity.
Question 9. What is the best predicting model according to MSPR (prediction error on the vali- dation set)? Give the fitted model coefficients and corresponding input names of that selected model and compare to our previous results, commenting briefly.
Question 10. For this item, instead of using the validation set approach of the previous item, we will use K-fold cross-validation (CV(K)) as the criterion to select a best predicting model. In particular, use K = n, i.e., n-fold CV, CV(n), or LOOCV. That is, K = n folds produce n training sets of size ni = n - 1 to predict on each of the corresponding n validation sets of size n∗i = 1 thereby producing CV(n), an estimate of generalization error, one for each of our eight models that we've been considering. (i is indexing folds now as in the expression on the bottom of page 73 of our notes.) (You will find various recommendations as to the best value of K to use, e.g., K = 5 or K = 10, etc., but we use LOOCV for illustration.)
We will leave the details of computing CV(n) to the function, boot::cv.glm; this will compute the expression on the bottom of page 73 of our notes, with K = n, of course. (You may have to download/install boot...?) (Given that you (should have) computed MSPR in the previous problem, it would not be difficult, though it may be a bit tedious, to compute CV(n) yourselves from the expression in our notes.) To use this function, we need to fit our models using the function, stats::glm (stats is installed by default). For example, the code, below, uses glm to fit a linear model, just as lm does, and just as regsubsets does (we don't exploit the ‘g'eneralized in glm; we will in INF 512). The code then performs CV(n) by default using cv.glm. In the example code, CV(n) is contained in the first element of the list component named delta in the list object named eg.cv.n, returned by cv.glm (and is put into cv.n).
> ## This is not likely one of your eight models! Just an example!
>eg.glm<-glm(y~x1+x2+x3,data=hw1.df)
>eg.cv.n<-boot::cv.glm(data=hw1.df,glmfit=eg.glm)
>(cv.n<-eg.cv.n$delta[1]) ## CV(n) for this model
Using the above code to get you started, perform CV(n) on eight models obtained from regsubsets, analogous to the validation set approach, above, but using the entire data set in the CV approach, just as we did using the originally considered criteria that don't require a separate validation set, right!? Once you get your models from regsubsets (perhaps similar to the way I showed above), you should (re-)fit each model "manually" in glm or you may devise a more automated way to (re-)fit these eight models using glm.
Show your code (i) for getting the eight models from regsubsets (perhaps like I did above), (ii) for fitting each of the eight models using glm and (iii) for performing CV(n) using cv.glm. Also, (iv) print the value of CV(n) for each model and indicate the size of the best model according to CV(n).
Question 11. Using the results from the previous item, prepare a plot of CV( n) vs. model size. Show your code and plot.
Question 12. What is the best predicting model according to CV( n)? Give the fitted model coeffi- cients and corresponding input names.
Question 13. A previous problem asked for a prediction from the best fitted model according to BIC. Repeat that problem now using the best fitted model according to CV(n). Again, use the median values of the inputs for x∗ (not all of which are used)
Attachment:- Data Mining And Machine Learning.rar