Plot the receiver operating characteristic

Assignment Help Applied Statistics
Reference no: EM131443621

Abalone data for Q1

In this problem, we are going to analyze a datasets with 4177 subjects data with 8 variables, and will try to predict whether or not the ring of abalone is greater 9 or not. The complete dataset description can be found at https://archive.ics.uci.edu/ml/datasets/Abalone Below are the list of all variables in the dataset are :

- Sex:nominal variable - takes levels of M, F, and I (infant).
- Length:continuous variable (mm) - Longest shell measurement
- Diameter:continuous variable (mm) - perpendicular to length
- Height:continuous variable (mm) - with meat in shell
- Whole weight:continuous variable (grams) - whole abalone
- Shucked weight:continuous variable (grams) - weight of meat
- Viscera weight:continuous variable (grams) - gut weight (after bleeding)
- Shell weight:continuous variable (grams) - after being dried
- Rings:integer

We are interested in predicting the rings variable is greater than 9 or not. So you need to create the binary response based on it,

faba <- read.table("abalone.data",sep=",")

faba$y <- ifelse(faba$V9>8,1,0)

head(faba)

##

 

V1

V2

V3

V4

V5

V6

V7

V8

V9

y

##

1

M

0.455

0.365

0.095

0.5140

0.2245

0.1010

0.150

15

1

##

2

M

0.350

0.265

0.090

0.2255

0.0995

0.0485

0.070

7

0

##

3

F

0.530

0.420

0.135

0.6770

0.2565

0.1415

0.210

9

1

##

4

M

0.440

0.365

0.125

0.5160

0.2155

0.1140

0.155

10

1

##

5

I

0.330

0.255

0.080

0.2050

0.0895

0.0395

0.055

7

0

##

6

I

0.425

0.300

0.095

0.3515

0.1410

0.0775

0.120

8

0

Ships data for Q2

We are interested in the number of accidents per month for a sample of ships (a classic example given by McCullagh & Nelder, 1989). The data can be found in the file "ships.csv" and it contains 40 subjects data with 14 variables. The response variable is called ACC. The explicative variables are:

- TYPE: there are 5 ships, labelled as 1-2-3-4-5. Type is a categorical variable, and 5 dummyTA, TB, TC, TD, TE.
- CONSTRUCTION YEAR: the ships are constructed in one of four periods, leading to the dummy variablesT6064, T6569, T7074, T7579.
- MONTHS: a measure for the amount of service months that the ship has already carried out.

ships <- read.table("ships.csv",header=T,sep=",") str(ships)

head(ships)

##

 

TYPE

TA

TB

TC

TD

TE

T6064

T6569

T7074

T7579

O6074

O7579

MONTHS

ACC

##

1

1

1

0

0

0

0

1

0

0

0

1

0

127

0

##

2

1

1

0

0

0

0

1

0

0

0

0

1

63

0

##

3

1

1

0

0

0

0

0

1

0

0

1

0

1095

3

##

4

1

1

0

0

0

0

0

1

0

0

0

1

1095

4

##

5

1

1

0

0

0

0

0

0

1

0

1

0

1512

6

##

6

1

1

0

0

0

0

0

0

1

0

0

1

3353

18

Q1. Binary classiftcation of Abalone data.

1(a) We are going to use the first 3133 samples to train the model, and the rest will be used as the test set. Show your R code to get the training data and testing data. Find the mean and standard error of the continous variables (V2-V8). Standardize all the continous predictors (V2-V8) in the training set using formula (X - X¯ )/sd(X). Use the mean and sd in the training set to standardize the corresponding predictor in the testing data set.

xtrain <-faba[1:3133,1:8]
ytrain <- as.factor( faba[1:3133,10] ) xtest <-faba[- c(1:3133),1:8]
ytest <- as.factor( faba[-c(1:3133),10] )
# continue to write your code

(1b) Fit a LASSO logistic regression (i.e., logistic regression with a LASSO penalty) model using glmnet. Use 10-fold cross-validation to choose the optimal value of the regularizer, show your R code and print the optimal λ obtained from the cross-validation. Predicting with the training and testing data set, print the confusion matrix and report mean error rate (fraction of incorrect labels), respectively.

# Training the model on the standardized training set
# alpha=0 for ridge penalty; alpha=1 for the LASSO penalty
library(glmnet)

# .....

1(c) Plot the receiver operating characteristic (ROC) curve on the test data. Use package ROCR to get the ROC curve and use ggplot2 to plot the ROC curves. Report the area under the ROC curve (AUC).

1(d) Plot the receiver operating characteristic (ROC) curve on the test data using ridge penalty. Also, report the area under the ROC curve (AUC).

Q2. Analysis of ships data.

(2a) Make a histogram of the variable ACC. Comment on its form.

ships=read.table("ships.csv",header=T,sep=",")

# ...

Comments:
. . .
(2b) Estimate the Poisson regression model including all explicative variables and a constant term.Show your R code and summary output, comment on the coefficient for the variables MONTHS, is it significant?
Be careful on fitting the Poisson model. Note that if you include all the Type (TA-TE) and years (T6569- T7579) dummy variables, an error message would be generated, and no estimation would be performed. To avoid it, TA was chosen to be the reference category for type, and T6064 was chosen to be the reference category for construction year.

ships=read.table("ships.csv",header=T,sep=",")
options(scipen=5)

#...

Comments on the coefficient for the variable MONTH:
. . .
(2c) Perform a Wald test for the joint significance of all the type dummy variables. Specify the H0
and Ha, and your conclusion.
#....

(2d) Given a ship of category TA, constructed in the year period 65-69, with MONTHS=1000. Predict the number of accidents per month. Also, estimate (1) The probability that no accidents will occur for this ship. (2) the probability that at least two accidents will occur.

#..
# prob of (1)
#..
# prob (2)

Q3. Analysis of 3-way contingency table

 

 

Heart disease

 

Gender

Cholesterol

Yes     No

Total

Male

 

High

16    256

272

Low

28    2897

2925

Female

 

High

13    319

332

Low

23   2565

2588

 

Total

80    6037

6117

You investigate the relationship between serum cholesterol (C), gender (G) and heart disease (H), and acquire the following data.

(3a) State the loglinear model that only expresses the main effects of the three characteristics on the expected counts. Interpret the assumption of the model, and compute the fitted values in the top left count of the table, i.e. (male, high cholesterol, with the disease) according to the model.

(3b) State the loglinear model that expresses all the main effects, and also an interaction between Cholesterol and Gender, and an interaction between Cholesterol and Heart disease. Interpret the assumption of the model, and compute the fitted values in the top left count of the table, i.e. (male, high cholesterol, with the disease) according to the model.

For model in (a) and (b), which one is better? Make your conclusion based on AIC and likelihood ratio test.

Verified Expert

This Assignment is completely based on R programming, and i have used R studio software for this.I have many functions in R for drawing graphs and installing packages which are required.Basically exploring the structure of the data set ans producing summary statistics like Mean,Standard error and count of all the observations for important variables which are used for this analysis.Next step is finding the aggregate values on some important variables which are related to assignment task and also creating plots and graphs by using important functions like Histogram, ggplot plot for plotting aggregate values.

Reference no: EM131443621

Questions Cloud

Discuss about the theoretical conceptualization : Diagnostic criteria described (this would include information, such as symptoms , considerations for symptoms needed to qualify, such as impairment or consistency in occurrence of symptoms , duration , and rule out criteria.Supporting information ..
Benefits of the program : The purpose of the program, the target population or audience, the benefits of the program, the cost or budget justification, the basis upon which the program or project will be evaluated.
What is a negative stakeholder : What is a negative stakeholder? Should a negative stakeholder be part of the project? Why or why not?
What is mpc : If government purchases increased by $20 billion, other things being equal, what would be the resulting change in aggregate demand, and how much of that change would be a change in consumption, if the MPC were.
Plot the receiver operating characteristic : STA303/1002 Fit a LASSO logistic regression (i.e., logistic regression with a LASSO penalty) model using glmnet. Use 10-fold cross-validation to choose the optimal value of the regularizer, show your R code and print the optimal λ obtained from t..
Are there any other alternatives worth : Other than executory arbitration, Are there any other alternatives worth considering when it comes to pre established agreements for employers?
Who are the main theorists associated with the theory : the information provided in the tables should not just be a listing. I would like to see explanations and applications of the concepts that you are discussing for each theory. With that in mind, your sections should be similar to a substantive po..
What decision style would be appropriate : What decision-making style do you think would be most appropriate in the following circumstances? Take into consideration the degree to which the feelings of others should be taken into account. Justify your choice in each case.
Judicial court system did this legal opinion occur : At what level of the judicial court system did this legal opinion occur? What was the opinion of the lower court that was finally overturned in Simkins?

Reviews

inf1443621

4/8/2017 5:53:09 AM

Unlike other services, ExpertsMind does not require tons of your personal information and long order forms. I feel quite secure with these guys. Keep it up! Coming to my work, its fabulous, i am quite happy with the work.

len1443621

3/29/2017 2:06:53 AM

Be independent. Your solutions must be written up independently (i.e., your solutions should not be the same as another students solutions). • Due date: Late assignments will be subject to a deduction of 5% of the total marks for the assignment for each day late. Any late assignment after the day I post the solution will get zero mark. • Presentation of solutions is very important.

Write a Review

Applied Statistics Questions & Answers

  What is the conditional probability

What is the conditional probability that a randomly selected customer would order a biscotti, given that he or she orders coffee?

  Find the probability of drawing a diamond card

Find the probability of drawing a diamond card in each of the consecutive draws from a wellshuffled pack of cards, if the card drawn is not replaced after the first draw.

  Comment on the sign of each estimated coefficient in turn

Comment on the sign of each estimated coefficient in turn, and state whether this is what you expect. Ignore significance at this stage.

  Strategic placement of lobster traps is one of the keys for

Strategic placement of lobster traps is one of the keys for a successful lobster fisherman. An observational study of teams fishing for the red spiny lobster in Baja California Sur, Mexico, was conducted and the results published in Bulletin of Marin..

  The variable of interest has a binomial distribution

Determine whether the variable of interest has a binomial distribution.

  Examination of descriptive statistics

Determine the research questions. What statistics will explain your situation - How was the sample collected? What was the probably sampling method and What is the level of measurement - nominal, ordinal, interval, ratio?

  Comment on appropriateness of t procedures for data

Given the coarseness of the rating, the owner of the Sir Beef-a-lot chain considers only a difference in the means of at least 0.25 units as meaningful. Based on your results in parts (c) and (d), what would you tell this owner?

  What is the difference between nominal, ordinal, and scale

What is the difference between Nominal, Ordinal, and Scale?My assignment has 3 variables: 1) reinforcement schedules (random or spaced)2) reinforcers (token, money, and food)3) dependent variable (test scores)

  Report the test statistic and p-value

Develop hypotheses, to calculate statistics, and to interpret output and summary tables - Focuses on the interpretation of multiple regression and logistic regression

  An airline wants to estimate the average number

An airline wants to estimate the average number of unoccupied seats per flight. A random sample of 30 flights is selected and the mean number of unoccupied seats is 7.6. Assume that the population standard deviation of unoccupied seats is known to be..

  Which investment classes is probability of a return greater

For which investment classes is the probability of a return greater than 50 percent essentially zero? For which investment classes is the probability of such a return greater than 1 percent? Greater than 5 percent?

  What it means for two events to be mutually exclusive

Explain what it means for two events to be mutually exclusive; for N events. If A and B are events, define (in words) A, A U B, A n B, and A n B.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd