INF 504 Data Mining And Machine Learning Assignment

Assignment Help Other Subject
Reference no: EM133078373

INF 504 Data Mining And Machine Learning - Northern Arizona University

Assignment 1

Question 1. The data for this assignment are available in BbLearn in the file hw1.rds. Show your code and output to complete the following.
(i) Use the function readRDS to read the data into a data frame object that you name hw1.df. See help(readRDS) if you need help.
(ii) Display the first 5 and last 5 rows of your data set-no more and no less, please- thereby giving some indication that the data have been read correctly.

Question 2. Create an appropriate scatter plot matrix, similar to that seen in our notes. You may use the pairs function or a similar function in R. Show your code and plot. No discussion required.

Question 3. How many linear models are possible using all possible subsets of the inputs, including the model consisting only of the intercept, without any inputs? NOTE: If it's not obvious by now, the variable names are suggestive of output and inputs!

Question 4. Use the function, regsubsets, in the leaps package, to perform an exhaustive search for the best model/subset of inputs of each possible size and use the result to create a ‘matrix' of plots, one for each of RSS, R2, R2a, CP, and BIC, each plot indicating the (same) single best model for each of the possible sizes (the difference among these k = 8 best models of a given size across plots being, of course, the different values of the different criteria, which we use to select across size). (Very similar to plots shown in our notes.) For the plot of CP vs. number of inputs, add the reference line, CP = 1 + P (i.e., CP = p where p is the number of parameters in model), just like in our notes.

(i) What relatively simple criterion or criteria can we regsubsets use to select models of the same size? (Granted, you are requested to report only the best of each size (along with computing each of the above criteria values).)

(ii) Show your code and plots, appropriately annotated.

Question 5. Identify the best model(s), over all sizes, as selected by each of the criteria, BIC, R2 and CP -only these three, please. This will result in one or more models, depending on how these criteria (dis)agree or are ambiguous. To be sure, please indicate best model input names and least squares estimates of corresponding parameters (i.e., the learned weights, in machine learning lingo). Discuss how these criteria agree or not and which model(s) you might select and why.

Question 6. Suppose we want to predict the unobserved output, y∗, for a "typical" input, x∗, for which we will use the medians of the training set inputs for expediency of this item. Use your best fitted model, according to BIC, above (regardless of which model(s) you may have chosen there), to compute the predicted value, along with a (nominal) 95% predic- tion interval for y∗. To do this, use the predict function with the interval=‘predict' option. See our INF 511 notes (available in BbLearn) or help(predict.lm) or the Web for examples of the predict function. Show your code/output and summarize briefly your prediction interval. In particular, comment on the validity of the nominal 95% coverage rate and the interval width with unbiasedness and overfitting in mind.

Question 7. Why does it not seem plausible to use the validation set approach, using MSPR (i.e., prediction error, expression (2.3) in our notes, an unbiased estimate of generalization error, expression (2.1) in our notes), for this data set? Please try to be concise.

Question 8. Despite concern's raised by the previous item, let's try the validation approach anyway to see how it works. I prepare and randomly divide the data into n = 20 training observations and n∗ = 13 validation observations for you.

> ## Randomly choose n=20 training cases, with remaining n*=13 for
> ## validation.
>set.seed(24601+711) ## Jean Valjean gets a Big Gulp (reproducibility)
>(ntot<-dim(hw1.df)[1])
>n<-20 ##<-- training set size
>trainindx<-sample(x=1:ntot,size=n,replace=FALSE)
>train.df<-hw1.df[trainindx,]
>val.df<-hw1.df[-trainindx,]

Here, we compare the validation set approach to the previous criteria using the previous eight models obtained from regsubsets, best of each size (according to RSS), and best across size according to whichever of the previous criteria. (We get the best models according to our previous criteria, but we may be missing the best according to the validation set approach by restricting ourselves to these eight models; we would have to conduct the validation set approach on all 28 = 256 models to be sure-not bad-but we ignore this for expediency of this item and just use the eight models previously obtained for our comparison.) I get you started by obtaining for you the fitted/predicted values yi∗ = µ(xi∗ ), i∗ = 1, . . . , n∗ = 13, for each of the eight models on the validation set.

>atemods<-coef(hw1.reg,id=1:8)
>atemuhats<-sapply(atemods,function(bhat,val) {
+X<-as.matrix(val[,names(bhat)[-1],drop=FALSE])
+B<-as.matrix(bhat[-1])
+muhat<-as.vector(bhat[1]+X *?B)
+val= val.df)
>dim(atemuhats)

Now, you estimate generalization error (expression (2.1) in our notes) by computing MSPR (expression (2.3) in our notes) for each of the eight models. (Loosely and popularly, we'd just say "compute prediction error on the validation set for each of the eight models.")

(i) Estimate generalization error (expression (2.1) in our notes) by computing MSPR (expression (2.3) in our notes) for each of the eight models. (Loosely and popu- larly, we'd just say "compute prediction error on the validation set for each of the eight models.") (10 points)

(ii) Create a plot of MSPR vs. model complexity.

Question 9. What is the best predicting model according to MSPR (prediction error on the vali- dation set)? Give the fitted model coefficients and corresponding input names of that selected model and compare to our previous results, commenting briefly.

Question 10. For this item, instead of using the validation set approach of the previous item, we will use K-fold cross-validation (CV(K)) as the criterion to select a best predicting model. In particular, use K = n, i.e., n-fold CV, CV(n), or LOOCV. That is, K = n folds produce n training sets of size ni = n - 1 to predict on each of the corresponding n validation sets of size n∗i = 1 thereby producing CV(n), an estimate of generalization error, one for each of our eight models that we've been considering. (i is indexing folds now as in the expression on the bottom of page 73 of our notes.) (You will find various recommendations as to the best value of K to use, e.g., K = 5 or K = 10, etc., but we use LOOCV for illustration.)

We will leave the details of computing CV(n) to the function, boot::cv.glm; this will compute the expression on the bottom of page 73 of our notes, with K = n, of course. (You may have to download/install boot...?) (Given that you (should have) computed MSPR in the previous problem, it would not be difficult, though it may be a bit tedious, to compute CV(n) yourselves from the expression in our notes.) To use this function, we need to fit our models using the function, stats::glm (stats is installed by default). For example, the code, below, uses glm to fit a linear model, just as lm does, and just as regsubsets does (we don't exploit the ‘g'eneralized in glm; we will in INF 512). The code then performs CV(n) by default using cv.glm. In the example code, CV(n) is contained in the first element of the list component named delta in the list object named eg.cv.n, returned by cv.glm (and is put into cv.n).

> ## This is not likely one of your eight models! Just an example!
>eg.glm<-glm(y~x1+x2+x3,data=hw1.df)
>eg.cv.n<-boot::cv.glm(data=hw1.df,glmfit=eg.glm)
>(cv.n<-eg.cv.n$delta[1]) ## CV(n) for this model

Using the above code to get you started, perform CV(n) on eight models obtained from regsubsets, analogous to the validation set approach, above, but using the entire data set in the CV approach, just as we did using the originally considered criteria that don't require a separate validation set, right!? Once you get your models from regsubsets (perhaps similar to the way I showed above), you should (re-)fit each model "manually" in glm or you may devise a more automated way to (re-)fit these eight models using glm.

Show your code (i) for getting the eight models from regsubsets (perhaps like I did above), (ii) for fitting each of the eight models using glm and (iii) for performing CV(n) using cv.glm. Also, (iv) print the value of CV(n) for each model and indicate the size of the best model according to CV(n).

Question 11. Using the results from the previous item, prepare a plot of CV( n) vs. model size. Show your code and plot.

Question 12. What is the best predicting model according to CV( n)? Give the fitted model coeffi- cients and corresponding input names.

Question 13. A previous problem asked for a prediction from the best fitted model according to BIC. Repeat that problem now using the best fitted model according to CV(n). Again, use the median values of the inputs for x∗ (not all of which are used)

Attachment:- Data Mining And Machine Learning.rar

Reference no: EM133078373

Questions Cloud

Set up a decision tree for the company situation : Jaguar Private Limited Company is engaged in the production of power transformers. Currently, the management team of the company is trying to decide whether to
Explain the natural monopolies : "Natural Monopolies can be classified as Government Created Monopolies"- do you agree with this statement? Please justify. You may provide necessary examples.
What amount should Cot report as a foreign exchange gain : Cot received the customer's remittance in full on November 16, 2020, and sold the 200,000 rubles for $160,000. What amount should Cot report
Specify the population parameter to be tested : A real estate investor thinks the real estate market has bottomed out. One of the variables he examined to arrive at this conclusion was the proportion of house
INF 504 Data Mining And Machine Learning Assignment : INF 504 Data Mining And Machine Learning Assignment Help and Solution, Northern Arizona University - Assessment Writing Service
Prepare Swifty journal entries for the purchase : Prepare Swifty' journal entries for (a) the purchase of the investment, (b) the interest received, and (c) the fair value adjustment
What are the impacts of trade on the specific factors model : What are the impacts of trade on the specific factors model? Please give examples to illustrate your answer.
How much is total cash inflows from all sources in October : Budgeted material purchases for September and October are: September $53,000. How much is the total cash inflows from all sources in October
Record the Journal Entry for the purchase of the land : On 1/1/20, Expecto Patronum Productions issued a 3%, 20 year, $1,000,000 Bond Payable in exchange. Record the Journal Entry for the purchase of the land

Reviews

len3078373

2/2/2022 12:54:47 AM

Document is the assignment HomeWork file Rest is class notes and Data File Please do it carefully, follow each and every instruction and Marking Criteria and References Very STRICTLY. Follow word limit if any is given.

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd