Reference no: EM132363421
R Programming Assignment -
Task 1 -
Let's see the effect of altering the values of our weights $\alpha$.
We'll use this following toy data, where x is the values 1 through 4, and t is the values, 1, 3, 2, 4.
'''{r}
x <- c(1,2,3,4)
t <- c(1,3,2,4)
'''
We begin by setting our vector of weights all equal. I have already fit a model using 'lm()' with the argument 'weights'. Use matrix operation to find and print the parameter estimates. They should equal the parameter estimates found via 'lm()'
'''{r}
a1 <- c(1,1,1,1)
model1 <- lm(t ~ x, weights = a1)
plot(x,t,xlim = c(0,5), ylim = c(0,5), asp=1)
abline(model1)
print(model1$coefficients)
Your code to make the matrices and find the parameter estimates.
Now we alter the weights. This vector puts large weight on the two inner points (x=2, x= 3), and small weight on the outer points (x=1, x=4). Again, use the matrix operations to find and print the parameter estimates using the provided weights. Compare them against the estimates found via 'lm()'. I have plotted the fitted line, comment on the effect of the weights.
''{r}
a2 <- c(0.1, 5, 5, 0.1) #
model2 <- lm(t ~ x, weights = a2)
plot(x,t,xlim = c(0,5), ylim = c(0,5), asp=1)
abline(model2)
print(model2$coefficients)
We alter the weights again. This time large weight are on the two outer points (x=1, x=4), and small weight on the inner points (x=2, x= 3). Again, use the matrix operations to find and print the parameter estimates using the provided weights. Compare them against the estimates found via 'lm()'. Look at the fitted line and comment on the effect of the weights.
'''{r}
a3 <- c(5, 0.1, 0.1, 5) #
model3 <- lm(t ~ x, weights = a3)
plot(x,t,xlim = c(0,5), ylim = c(0,5), asp=1)
abline(model3)
print(model3$coefficients)
'''
Try to explain Weighted Least Squares regression. What effect would putting a very large weight, say 1000, on a point have on the regression line? What effect would putting a weight of 0 on a point have on the regression line?
Task 2 - OLS Matrix Notation
Look at the Chirot dataset which covers the 1907 Romanian Peasant Revolt.
The data covers 32 counties in Romania, and the response variable is the intensity of the rebellion.
The orginal paper by Daniel Chirot, which provides details of the analysis and variables.
A basic data dictionary can be found with 'help(Chirot)'
'''{r}
library(carData)
data(Chirot)
chirot_mat <- as.matrix(Chirot)
'''
We'll do an analysis with matrix operations. We start by extracting the commerce column and creating our $\mathbf{X}$ matrix.
'''{r}
t <- chirot_mat[, 1, drop = FALSE] # response, keep as a matrix
x <- chirot_mat[, 2, drop = FALSE] # commerce column, keep as matrix
X <- cbind(1, x)
colnames(X) <- c('1','commerce')
head(X)
'''
Use 'lm()' to fit the rebellion intensity to matrix X (which has columns for a constant and commercialization). Make sure you only calculate the coefficient for the intercept once.
Using only matrix operations, calculate and show the coefficient estimates. Verify that they match the estimates from 'lm()'.
Create another matrix (call it X_ct) with three columns: a constant, variable commerce, variable tradition. Print the head of this matrix.
Using matrix operations, calculate and show the coefficient estimates of the model with the variables commerce and tradition.
Create another matrix (call it X_all) with all of the x variables (plus a constant). Print the head of this matrix. Using matrix operations, calculate and show the coefficient estimates of the model with all the variables.
Using matrix operations, calculate the fitted values for all three models. (No need to print out the fitted values.) Create plots of fitted value vs actual value for each model (there will be three plots). Be sure to provide a title for each plot.
Now that you have calculated the columns of fitted values, find the Residual Sum of Squares and the R-squared values of the three models. Which model has the smallest RSS? (RSS is the sum of (actual - fitted)^2. R-sq is the correlation between actual and fitted, squared.)
Task 3 - Cross-Validation
We can use cross-validation to evaluate the predictive performance of several competing models.
I will have you manually implement leave-one-out cross-validation from scratch first, and then use the built-in function in R.
We will use the dataset 'ironslag' from the package 'DAAG' (a companion library for the textbook Data Analysis and Graphics in R).
The description of the data is as follows: The iron content of crushed blast-furnace slag can be determined by a chemical test at a laboratory or estimated by a cheaper, quicker magnetic test. These data were collected to investigate the extent to which the results of a chemical test of iron content can be predicted from a magnetic test of iron content, and the nature of the relationship between these quantities. [Hand, D.J., Daly, F., et al. (1993) __A Handbook of Small Data Sets__]
The 'ironslag' data has 53 observations, each with two values - the measurement using the chemical test and the measurement from the magnetic test.
We can start by fitting a linear regression model $y_n = w_0 + w_1 x_n + \epsilon_n$. A quick look at the scatterplot seems to indicate that the data may not be linear.
'''{r linear_model}
# install.packages("DAAG") # if necessary
library(DAAG)
x <- seq(10,40, .1) # a sequence used to plot lines
L1 <- lm(magnetic ~ chemical, data = ironslag)
plot(ironslag$chemical, ironslag$magnetic, main = "Linear fit", pch = 16)
yhat1 <- L1$coef[1] + L1$coef[2] * x
lines(x, yhat1, lwd = 2, col = "blue")
'''
In addition to the linear model, fit the following models that predict the magnetic measurement (Y) from the chemical measurement (X).
Task 3A - Your job is to create the plots with fitted lines.
When plotting the fitted line for this one, create estimates of log(y-hat) linearly.
Task 3B: Leave-one-out Cross validation
You will now code leave-one-out cross validation. In LOOCV, we remove one data point from our data set. We fit the model to the remaining 52 data points. With this model, we make a prediction for the left-out point. We then compare that prediction to the actual value to calculate the squared error. Once we find the squared error for all 53 points, we can take the mean to get a cross-validation error estimate of that model.
To test out our four models, we will build a loop that will remove one point of data at a time. Thus, we will make a 'for(i in 1:53)' loop. For each iteration of the loop, we will fit the four models on the remaining 52 data points, and make a prediction for the remaining point.
Write a line to select the ith line in the data
Compare the sizes of the cross validation error to help you decide which model does the best job of predicting the test cases.
Task 3C: Cross-validation with R
Now that you have written your cross-validation script from scratch, we can use the built-in functions in R. Library(boot) has the function 'cv.glm()' which can be used to estimate cross-validation error.
To make use of 'cv.glm()' on the linear models, we must first use 'glm()' to fit a generalized linear model to our data. If you do not change the attribute "family" in the function 'glm()', it will fit a linear model
'''{r cvwithR, error = TRUE}
library(boot)
gL1 <- glm(magnetic ~ chemical, data = ironslag) # equivalent to lm(magnetic ~ chemical)
Find the LOOCV CV values for all of the models.
Your LOOCV estimates from 'cv.glm()' should match your estimates when you coded your algorithm from scratch.
Based on your Cross Validation scores, which model seems to be the best?
Task 3D: Revisit Chirot
Use 'glm()' to fit the three models to the Chirot data.
model1: intensity ~ commerce
model2: intensity ~ commerce + tradition
model3: intensity ~ all other variables
Use 'cv.glm()' to calculate the LOOCV scores to compare models. Which model would you select?
Task 4: Using R's 'optim()' function
The 'optim()' function in R is quite versatile and can be used to minimize a wide range of functions using a variety of methods. The default method, the Nelder-Mead simplex algorithm, is a numeric method that works quite well for multidimensional functions. Other methods, like BFGS which uses a modified version of gradient descent, are included and are also quite powerful.
We will do an exercises to use 'optim()' in to minimize the loss function in linear regression.
We'll use a synthetic data set.
Use 'optim()' to minimize the sum of squared residuals in linear regression
For the first task, we define the loss function as the sum of squared residuals.
Write your loss function as a function of the vector 'par' which will have a lenght of 2. The vector par will contain the current estimates of the intercept and slope. Treat the data ('x' and 't') as fixed quantities that are available in the global environment.
The function will return a single value: the sum of squared residuals.
The following chunk runs optim and returns the results. The arguments it takes in is the initial values of 'par', and the function that it is trying to minimize. I also specify the method 'BFGS' which uses a version of gradient descent as the numeric algorithm to optimize the function.
'''{r, error = TRUE}
results <- optim(par = c(0,1), fn = loss, method = 'BFGS')
results$par
'''
When we look at the results and the values of the parameters that minimize the loss function, they should match the coefficients estimated by 'lm()'.
Attachment:- R Programming Assignment File.rar