Reference no: EM132297770
Hypothesis Testing
We have the same spike data as for Assignment 1.
spikes <- c(0.220136914, 1.252061356, 0.943525370, 0.907732787, 1.157388806, 0.342485956,
0.291760012, 0.556866189, 0.738992636, 0.690779640, 0.425849738, 0.876344116,
1.248761245, 0.697514552, 0.174445203, 1.376500202, 0.731507303, 0.483036515,
0.650835440, 1.106788259, 0.587840538, 0.978983532, 1.179754064, 0.941462421,
0.749840071, 0.005994156, 0.664525928, 0.816033621, 0.483828371, 0.524253461)
You are to test the hypothesis that the spike distribution is Weibull with shape=2.5 and scale=1.0. You will create a hypothesis testing algorithm by sampling. So:
H0 : spikes ~ Weibull(shape=2.5, scale=1.0)
The test will consist of generating samples of size n = 30 and seeing how often their likelihood is less than the likelihood of the sample spikes above. To not reject H0 with a confidence of 90%, their likelihood should be less than the spikes likelihood more than 10% of the time.
Generation
Write the sample generator and the likelihood function in R. The likelhood function should input a single data vector like spikes and return its likelihood under H0.
Confidence in sampling
Make sure you generate enough samples so that you can be 95% confident of accept/reject. Come up with a reasonable justifcation using confidence intervals. How many samples are needed and why?
Hypothesis testing
Run the sampler for enough samples to evaluate the hypothesis. Report your results.
Logistic Regression
We're going to explore the Indian Liver Patient dataset from Kaggle
The first 400 cases are used as training data, and the remaining as test data. The dataset and features are explained at the Kaggle site. In this question, we will explore the data with R's glm().
Loading the data
First load the data into a data frame using:
il <- read.csv("indian_liver_patient_train.csv", header=TRUE, sep=",") summary(il)
What issues can you see that could cause problems with using R on this data, for instance to use glm() to predict the presence of liver disease given by the Dataset variable? How would you overcome them? Write a short R script to do this.
Run glm() on the data
Try building a basic model with glm() with something like:
fit <- glm(target ∼ ., family=binomial, data=il)
where target is the variable indicating liver disease. Print the summary of the model to get the R diagnostics. Briefly explain the statistics in the summary, e.g. Z-value, standard error, etc. What does this imply about the features for your model?
Compare predictions
Then get fitted probabilities for the model using:
pld <- predict(fit, type="response")
Compare the fitted probabilities with the truth values graphically. What do you see?
Stepwise regression with backwards selection
Having built a model with all features, let us modify this by removing one feature at a time.
R provides all sorts of sophisticated routines for doing this like step(), but we will ignore these and write our own.
You will have to identify a way of selecting which feature to remove, or whether to stop removing and quit For this purpose you could run an additional glm() model, or you could grab diagnostics using R and test them somehow. The following are available, where fit is the returned structure from glm():
• run names(fit), and see variables such as fit$deviance
• values printed by summary(fit) can be obtained
• try expressions such as summary(fit)$coefficients["Age", 2]
• or summary(fit)$coefficients["Total_Proteins", 3]
To help you do this, code to run through all the features and test a formula is as follows:
# here is a DUMMY random function to select a feature to remove # => you should write a better one!
selectVar <- function(fset) {
# fset = the existing set of features if ( runif(1,0,1)>0.9 ) {
# your decision of when to stop in here return("STOP")
}
# your method to select the best feature here d <- as.integer(runif(1,0,length(fset)))
# return the d-th item in the list return(fset[d])
}
# this is the active predictor list
nil <- names(il)
# delete the target variables from it
nil <- nil[(!nil=="LD") & (!nil=="Dataset")] print(paste("STARTING: ", paste(nil,collapse="+")))
for (loop in 1:10) {
if ( length(nil) == 0) {
# removed everything so quit break
}
# var is to be tested var <- selectVar(nil) if ( var == "STOP") {
break
}
# remove from list nil <- nil[!nil==var]
print(paste("REMOVED: ", var))
# report
print(paste("RUNNING: ", paste(nil,collapse="+")))
# now run with modified list
fit <- glm(paste("LD ~ ",paste(nil,collapse="+")), data=il)
sum.fit <- summary(fit)
# do something, print some metric, but what? print(paste("GLM out: ", sum.fit$df.residual))
}
# report the final fit summary(fit)
Your task is to rewrite the function selectVar and anything else needed to get the backward selection working. You may want to print diagnostics at the end of the loop.
Discussion
Report on the initial and final fits produced by your algorithm. The initial fit used all predictors, but reported in the diagnostics that not all were useful. Compare these with your final model. How do the predictors actually used compare with their statistics in the initial model.
Read in the test data "indian_liver_patient_test.csv" and test the resultant models on the test data as well, and report the error rate of prediction. Has the backward selection worked? If not, how wrong was it, and why?
Linear Regression
We will repeat the exercise with linear regression. So read in the training dataset afresh, and this time the target to predict is Total_Proteins .
Run lm() on the data
Try building a basic model with lm() with something like:
fit <- lm(Total_Proteins ∼ ., data=il)
Print the summary of the model to get the R diagnostics. Explain the statistics in the summary, e.g. standard error, t value. Explaion how the probability (a p-value) in the table is computed. What does this imply about the features for your model? What does the vaue of multiple R-squared imply?
Compare predictions
Then get fitted values from the model using:
predict.Total_Proteins <- predict(fit, type="response")
Compare the fitted probabilities with the truth values graphically. What do you see?
Single variable model
Consider using just one predictor, so you have a model like Total_Proteins ∼ Albumin or Total_Proteins
∼ Total_Bilirubin .
What is a way to pick the best single predictor for this purpose? Explain an approach (but you don't have to code it or run it).
Stepwise regression with forwards selection
Starting with a model with no features, let us modify this by adding one feature at a time.
R provides all sorts of sophisticated routines for doing this like step(), but we will ignore this and write our own.
You will have to identify a way of selecting which feature to add, or whether to stop adding and quit. For this purpose you could run an additional lm() model, or you could grab diagnostics using R and test them somehow. The following are available, when fit is the returned structure from lm():
• values printed by summary(fit) can be obtained
• see names(summary(fit))
• try expressions such as summary(fit)$coefficients["Age", 2]
• or summary(fit)$r.squared
To help you do this, code to run through all the features and test a formula is as follows:
# here is a DUMMY random function to select a feature to add # => you should write a better one!
selectAddVar <- function(fset) {
# fset = the existing set of features to select from if ( runif(1,0,1)>0.9 | length(fset)==0) {
# your decision of when to stop in here return("STOP")
}
# your method to select the best feature here d <- as.integer(runif(1,0,length(fset))) return(fset[d])
}
# this is the active predictor list nil <- names(il)
# delete the target variables from set
nil <- nil[(!nil=="LD") & (!nil=="Total_Proteins")] print(paste("STARTING: ", paste(nil,collapse="+")))
# start with the empty set of features for lm() fset <- NULL
for (loop in 1:10) {
if ( length(nil)==0 ) {
# quit if none more to add break
}
# var is to be tested var <- selectAddVar(nil) if ( var == "STOP") {
# quit as told to stop break
}
# remove from list nil <- nil[!nil==var]
fset <- append(fset,var) print(paste("ADDED: ", var))
# report
print(paste("RUNNING: ", paste(fset,collapse="+")))
# now run with modified list
fit <- lm(paste("Total_Proteins ~ ",paste(fset,collapse="+")), data=il)
# do something, print some metric, but what? sum.fit <- summary(fit)
print(paste("LM out: ", sum.fit$r.squared))
}
# report the final fit summary(fit)
Your task is to rewrite the function selectVar and anything else needed to get the forward selection working.
Analysis and Discussion
Report on the initial and final fits produced by your algorithm, their diagnostics, as well as the final set of variables selected. How do they compare?
Read in the test data "indian_liver_patient_test.csv" and test the resultant models on the test data as well, and report the mean square error. Has the forward selection worked? If not, how wrong was it, and why?
Attachment:- Modelling for data analysis.rar