Solution-Write the sample generator and the likelihood

Write the sample generator and the likelihood function

Assignment Help Programming Languages

Reference no: EM132297770

Hypothesis Testing

We have the same spike data as for Assignment 1.
spikes <- c(0.220136914, 1.252061356, 0.943525370, 0.907732787, 1.157388806, 0.342485956,
0.291760012, 0.556866189, 0.738992636, 0.690779640, 0.425849738, 0.876344116,
1.248761245, 0.697514552, 0.174445203, 1.376500202, 0.731507303, 0.483036515,
0.650835440, 1.106788259, 0.587840538, 0.978983532, 1.179754064, 0.941462421,
0.749840071, 0.005994156, 0.664525928, 0.816033621, 0.483828371, 0.524253461)
You are to test the hypothesis that the spike distribution is Weibull with shape=2.5 and scale=1.0. You will create a hypothesis testing algorithm by sampling. So:
H0 : spikes ~ Weibull(shape=2.5, scale=1.0)

The test will consist of generating samples of size n = 30 and seeing how often their likelihood is less than the likelihood of the sample spikes above. To not reject H0 with a confidence of 90%, their likelihood should be less than the spikes likelihood more than 10% of the time.

Generation
Write the sample generator and the likelihood function in R. The likelhood function should input a single data vector like spikes and return its likelihood under H0.

Confidence in sampling
Make sure you generate enough samples so that you can be 95% confident of accept/reject. Come up with a reasonable justifcation using confidence intervals. How many samples are needed and why?

Hypothesis testing
Run the sampler for enough samples to evaluate the hypothesis. Report your results.

Logistic Regression

We're going to explore the Indian Liver Patient dataset from Kaggle
The first 400 cases are used as training data, and the remaining as test data. The dataset and features are explained at the Kaggle site. In this question, we will explore the data with R's glm().

Loading the data
First load the data into a data frame using:
il <- read.csv("indian_liver_patient_train.csv", header=TRUE, sep=",") summary(il)

What issues can you see that could cause problems with using R on this data, for instance to use glm() to predict the presence of liver disease given by the Dataset variable? How would you overcome them? Write a short R script to do this.

Run glm() on the data
Try building a basic model with glm() with something like:
fit <- glm(target ∼ ., family=binomial, data=il)
where target is the variable indicating liver disease. Print the summary of the model to get the R diagnostics. Briefly explain the statistics in the summary, e.g. Z-value, standard error, etc. What does this imply about the features for your model?

Compare predictions
Then get fitted probabilities for the model using:
pld <- predict(fit, type="response")
Compare the fitted probabilities with the truth values graphically. What do you see?

Stepwise regression with backwards selection
Having built a model with all features, let us modify this by removing one feature at a time.
R provides all sorts of sophisticated routines for doing this like step(), but we will ignore these and write our own.

You will have to identify a way of selecting which feature to remove, or whether to stop removing and quit For this purpose you could run an additional glm() model, or you could grab diagnostics using R and test them somehow. The following are available, where fit is the returned structure from glm():
• run names(fit), and see variables such as fit$deviance
• values printed by summary(fit) can be obtained
• try expressions such as summary(fit)$coefficients["Age", 2]
• or summary(fit)$coefficients["Total_Proteins", 3]
To help you do this, code to run through all the features and test a formula is as follows:
# here is a DUMMY random function to select a feature to remove # => you should write a better one!
selectVar <- function(fset) {
# fset = the existing set of features if ( runif(1,0,1)>0.9 ) {
# your decision of when to stop in here return("STOP")
}
# your method to select the best feature here d <- as.integer(runif(1,0,length(fset)))
# return the d-th item in the list return(fset[d])
}

# this is the active predictor list

nil <- names(il)
# delete the target variables from it
nil <- nil[(!nil=="LD") & (!nil=="Dataset")] print(paste("STARTING: ", paste(nil,collapse="+")))
for (loop in 1:10) {
if ( length(nil) == 0) {
# removed everything so quit break
}
# var is to be tested var <- selectVar(nil) if ( var == "STOP") {
break
}
# remove from list nil <- nil[!nil==var]
print(paste("REMOVED: ", var))
# report
print(paste("RUNNING: ", paste(nil,collapse="+")))
# now run with modified list
fit <- glm(paste("LD ~ ",paste(nil,collapse="+")), data=il)
sum.fit <- summary(fit)
# do something, print some metric, but what? print(paste("GLM out: ", sum.fit$df.residual))
}

# report the final fit summary(fit)
Your task is to rewrite the function selectVar and anything else needed to get the backward selection working. You may want to print diagnostics at the end of the loop.

Discussion

Report on the initial and final fits produced by your algorithm. The initial fit used all predictors, but reported in the diagnostics that not all were useful. Compare these with your final model. How do the predictors actually used compare with their statistics in the initial model.
Read in the test data "indian_liver_patient_test.csv" and test the resultant models on the test data as well, and report the error rate of prediction. Has the backward selection worked? If not, how wrong was it, and why?

Linear Regression

We will repeat the exercise with linear regression. So read in the training dataset afresh, and this time the target to predict is Total_Proteins .

Run lm() on the data
Try building a basic model with lm() with something like:
fit <- lm(Total_Proteins ∼ ., data=il)
Print the summary of the model to get the R diagnostics. Explain the statistics in the summary, e.g. standard error, t value. Explaion how the probability (a p-value) in the table is computed. What does this imply about the features for your model? What does the vaue of multiple R-squared imply?

Compare predictions
Then get fitted values from the model using:
predict.Total_Proteins <- predict(fit, type="response")
Compare the fitted probabilities with the truth values graphically. What do you see?

Single variable model
Consider using just one predictor, so you have a model like Total_Proteins ∼ Albumin or Total_Proteins
∼ Total_Bilirubin .
What is a way to pick the best single predictor for this purpose? Explain an approach (but you don't have to code it or run it).

Stepwise regression with forwards selection
Starting with a model with no features, let us modify this by adding one feature at a time.
R provides all sorts of sophisticated routines for doing this like step(), but we will ignore this and write our own.

You will have to identify a way of selecting which feature to add, or whether to stop adding and quit. For this purpose you could run an additional lm() model, or you could grab diagnostics using R and test them somehow. The following are available, when fit is the returned structure from lm():
• values printed by summary(fit) can be obtained
• see names(summary(fit))
• try expressions such as summary(fit)$coefficients["Age", 2]
• or summary(fit)$r.squared
To help you do this, code to run through all the features and test a formula is as follows:
# here is a DUMMY random function to select a feature to add # => you should write a better one!
selectAddVar <- function(fset) {
# fset = the existing set of features to select from if ( runif(1,0,1)>0.9 | length(fset)==0) {
# your decision of when to stop in here return("STOP")
}
# your method to select the best feature here d <- as.integer(runif(1,0,length(fset))) return(fset[d])
}

# this is the active predictor list nil <- names(il)
# delete the target variables from set
nil <- nil[(!nil=="LD") & (!nil=="Total_Proteins")] print(paste("STARTING: ", paste(nil,collapse="+")))

# start with the empty set of features for lm() fset <- NULL
for (loop in 1:10) {
if ( length(nil)==0 ) {
# quit if none more to add break
}
# var is to be tested var <- selectAddVar(nil) if ( var == "STOP") {
# quit as told to stop break
}
# remove from list nil <- nil[!nil==var]
fset <- append(fset,var) print(paste("ADDED: ", var))
# report
print(paste("RUNNING: ", paste(fset,collapse="+")))
# now run with modified list
fit <- lm(paste("Total_Proteins ~ ",paste(fset,collapse="+")), data=il)
# do something, print some metric, but what? sum.fit <- summary(fit)
print(paste("LM out: ", sum.fit$r.squared))
}

# report the final fit summary(fit)
Your task is to rewrite the function selectVar and anything else needed to get the forward selection working.

Analysis and Discussion

Report on the initial and final fits produced by your algorithm, their diagnostics, as well as the final set of variables selected. How do they compare?

Read in the test data "indian_liver_patient_test.csv" and test the resultant models on the test data as well, and report the mean square error. Has the forward selection worked? If not, how wrong was it, and why?

Attachment:- Modelling for data analysis.rar

Verified Expert

The scenario of this assignment are hypothesis testing and predictive modelling predictive modelling logistic regression and linear regression was performed with forward backward variable selection methods. In hypothesis testing it was shown that the generating samples are followed Weibull (shape=2.5, scale=1.0) distribution. The 95% confidence interval of shape parameter is 2.46 to 2.57 and the confidence interval of scale parameter is 0.98 to 1. In logistic regression it was shown that age, total proteins, albumin and albumin and globulin ratio were statistically significant factors (P value<0.05) of liver disease. Another model (linear regression) albumin and albumin and globulin ratio were highly significant factor of occurrence of liver disease.

Reference no: EM132297770

Questions Cloud

What are some of the key design issues for an smp : What are some of the key design issues for an SMP? Also, what are some of the key benefits of clustering?

Characteristics of a risc instruction set architecture : What are some of the typical characteristics of a RISC instruction set architecture? Select two and discuss why are they popular?

Discuss what external factors contribute : Discuss what external factors contribute to how we make decisions about IT security and what measures need to be taken to stay updated on these external factors

Describe the purpose and hypothesis of the experiment : You are a scientist in your life and in the real world. Part of the fun of learning about the world and how it works is going beyond the text and seeing.

Write the sample generator and the likelihood function : FIT5197 - Modelling for data analysis - Monash University - function selectVar and anything else needed to get the backward selection working. You may want

Convincingly defend what stage of adoption : Convincingly defend what stage of adoption, the largest cluster of June Oven users might be found given the narrative, and why you think so.

Mngt 5650-apply hr and ob to value chain and to adr : To connect strategic thinking to your other classes in your Master's program, think back to either or both your Human Resource Management course

The aircraft passenger-loading problem : Document the research that is being done on the aircraft passenger -loading problem.

Lacks had resisted discrimination and segregation : lacks had resisted discrimination and segregation for many years, but it was not until the ________ that substantial and legal changes and the securing of right

Reviews

len2297770

5/2/2019 1:50:11 AM

This assignment is worth 25% and includes questions related to Hypothesis Testing, Logistic Regression, and Linear Regression. Please let me know if it''s doable.Your presentation of results using R will be reviewed. How well do you use plots or other means of ordering and conveying results. Out of the 25 marks, 2 will be awarded for presentation using R. • creative use of plotting • somewhat readable printing of results (i.e., tables should roughly appear as tables, not as 10 numbers one after another)

5/2/2019 1:45:01 AM

Code quality marks Your R code will be reviewed for conciseness, efficiency, explainability and quality. Inline documentation, for instance, should demarcate key sections and explain more obtuse operations, but shouldn’t be over verbose. Out of the 25 marks, 3 will be awarded for code quality. • basic embedded comments to support understandability • reasonable variable names • good use of standard functions and libraries • writing own functions (e.g., limited repeated code) Presentation marks Your presentation of results using R will be reviewed. How well do you use plots or other means of ordering and conveying results. Out of the 25 marks, 2 will be awarded for presentation using R. • creative use of plotting • somewhat readable printing of results (i.e., tables should roughly appear as tables, not as 10 numbers one after another) • text descriptions embedded in results and good labels on plots

5/2/2019 1:44:49 AM

Submitting your assignment on Moodle Please submit your assignment through Moodle via upload a Word or PDF document as well as R markdown you used to generate results. • If you choose to use R markdown file to directly knit Word/PDF document, you would need to type in LaTex equations for Question 1,2 and 5. Find more information about using LaTex in R markdown fileshere. You may also find theR markdown cheatsheetuseful.WARNING: Full LaTex is challenging. • You can also work with Word and R markdown separately. In this case you would need to type your answers in Word and also copy R code (using the format: Courier New), results and figures to the Word document. We will mark your submission mostly using your Word/PDF document. However, you need to make sure your R markdown file is executable in case we need to check your code.

5/2/2019 1:44:35 AM

Marks This assignment contains 3 questions. There are 25 marks in total for the assignment and it counts for 25% of your overall grade for the unit. Also, 3 of the 25 marks are awarded for code quality and 2 of the marks awarded for presentation of results, for instance use of plots. That leaves 20 marks for individual answers. You must show all working, so we can see how you obtain your answer. Marks are assigned for working as well as the correct answer. Your solutions Please put your name and student number on the first page of your solutions. Do not copy questions in your solutions. Only include question numbers. If you use any external resources for developing your results, please remember to provide the link to the source.

Write a Review

Required(*) Message

User Account

All Pages

Write the sample generator and the likelihood function

Reference no: EM132297770

Reference no: EM132297770

Questions Cloud

Reviews

len2297770

len2297770

len2297770

len2297770

Write a Review

Programming Languages Questions & Answers

Write a haskell program to calculates a balanced partition

Create an application to run in the amazon ec2 service

Explain the process to develop a web page locally

Write functions

Programming assignment

Write a prolog program using swi proglog

Create a custom application using eclipse

Create a application using the mvc architecture

Develops bespoke solutions for the rubber industry

Design a program that models the worms behavior

Writing a class

Design a program that assigns seats on an airplane

Assured A++ Grade

Academics

Major Subjects

Majors

Get In Touch

TERMS & POLICIES

HELP & SUPPORT