Reference no: EM132221222
Exercise 1: Data for this exercise originated from the General Society Survey (GSS). The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years. Only data from the 2016 GSS survey is included in this dataset - GSS2016.csv.
For this exercise, you will need to load and activate the ggplot2 package. As a data science intern with newly learned knowledge in skills in statistical correlation, regression and R programming, you are interested in looking at the GSS 2016 survey data, specifically the Siblings and Childs variables have peaked your interest. A codebook for the GSS is available here: GSS_Codebook_Index.pdf and contains all of the GSS variables and descriptions.
The first question you are interested in answering is: "Is there a significant relationship between the number of siblings a survey respondent has and number of his or her children?"
Part 1 -
A. Construct a scatterplot of these two variables in R studio and place the best-fit linear regression line on the scatterplot. Describe the relationship between the number of siblings a respondent has (SIBS) and the number of his or her children (CHILDS).
B. Use R to calculate the covariance of the two variables and provide an explanation of why you would use this calculation and what the results indicate.
C. Choose the type of correlation test to perform, explain why you chose this test, and make a prediction if the test yields a positive or negative correlation?
D. Perform a correlation analysis of the two variables and describe what the calculations in the correlation matrix suggest about the relationship between the variables. Be specific with your explanation.
E. Calculate the correlation coefficient and the coefficient of determination, describe what you conclude about the results.
F. Based on your analysis, what can you say about the relationship between the number of siblings and the number of his or her children?
G. Produce an appropriate graph for the variables. Report, critique and discuss the skewness and any significant scores found.
H. Expand your analysis to include a third variable - Sex. Perform a partial correlation, "controlling" the Sex variable. Explain how this changes your interpretation and explanation of the results.
Report and discuss all of your calculations and critiques using R Markdown.
Part 2 -
A. Run a regression analysis where SIBS predict CHILDS.
B. What is the intercept and the slope? What is the coefficient of determination and the correlation coefficient?
C. For this model, how do you explain the variation in the number of children someone has? What is the amount of variation not explained by the number of siblings?
D. Based on the calculated F-Ratio does this regression model result in a better prediction of the number of children than if you had chosen to use the mean value of siblings?
E. Use the model to make a prediction: What is the predicted number of children for someone with three siblings?
F. Use the model to make a prediction: What is the predicted number of children for someone without any siblings?
Report and discuss all of your calculations and critiques using R Markdown.
Exercise 2:
Part 1 -
Data for this exercise is focused on real estate transactions recorded from 1964 to 2016 and can be found in Housing.xlsx. Using your skills in statistical correlation, regression and R programming, you are interested in the following variables: Sale Price and Square Footage of Lot.
A. Examine the data set visually and numerically. Are there missing data? Does that data contain outliers? Explain how you will handle missing data and outliers so you have a clean data set going forward.
B. Construct scatterplots for the variables in R studio and place the best-fit linear regression line on each scatterplot. Describe the relationship between the variables in each plot.
C. Use R to calculate the covariance of the two variables and provide an explanation of why you would use this calculation and what the results indicate.
D. Choose the type of correlation test to perform, explain why you chose this test, and make a prediction if the test yield a positive or negative correlation?
E. Perform a correlation analysis of the variables and describe what the calculations in the correlation matrix suggest about the relationship between the variables. Be specific with your explanation.
F. Calculate the correlation coefficient and the coefficient of determination, describe what you conclude about the results.
G. Based on your analysis what can you say about the relationship between the variables?
H. Produce an appropriate graph for the variables. Use R Markdown to report, critique and discuss the skewness and any significant scores found.
I. Perform a partial correlation, "controlling" for at least two variables. Explain how this changes your interpretation and explanation of the results.
J. Choose Square Footage of Lot as the Predictor and Sale Price as the Outcome and perform a regression analysis.
K. What are the intercept and the slope? What are the coefficient of determination and the correlation coefficient?
L. For this model, what variation exists. Be specific in your response.
M. Based on the calculated F-Ratio does this regression model result in a better prediction of the sale price than if you had chosen to use the mean value of square footage of lot?
N. Use the model to make a prediction of your choice. Explain the values you use in the model and the resulting prediction as well as how someone might benefit from using this model.
Report and discuss all of your calculations and critiques using R Markdown.
Part 2 -
Data for this assignment is focused on real estate transactions recorded from 1964 to 2016 and can be found in w7Housing.xlsx. Using statistical correlation, multiple regression and R programming, you are interested in the following variables: Sale Price and several other possible predictors.
Using your clean dataset from above, complete the following:
A. Explain why you choose to remove data points from your 'clean' dataset.
B. Create two variables; one that will contain the variables Sale Price and Square Foot of Lot (same variables used from previous assignment on simple regression) and one that will contain Sale Price and several additional predictors of your choice. Explain the basis for your additional predictor selections.
C. Execute a summary() function on two variables defined in the previous step to compare the model results. What are the R2 and Adjusted R2 statistics? Explain what these results tell you about the overall model. Did the inclusion of the additional predictors help explain any large variations found in Sale Price?
D. Considering the parameters of the multiple regression model you have created. What are the standardized betas for each parameter and what do the values indicate?
E. Calculate the confidence intervals for the parameters in your model and explain what the results indicate.
F. Assess the improvement of the new model compared to your original model (simple regression model) by testing whether this change is significant by performing an analysis of variance.
G. Perform casewise diagnostics to identify outliers and/or influential cases, storing each functions output in a dataframe assigned to a unique variable name.
H. Calculate the standardized residuals using the appropriate command, specifying those that are +-2, storing the results of large residuals in a variable you create.
I. Use the appropriate function to show the sum of large residuals.
J. Which specific variables have large residuals (only cases that evaluate as TRUE)?
K. Investigate further by calculating the leverage, cooks distance, and covariance rations. Comment on all cases that are problematics.
L. Perform the necessary calculations to assess the assumption of independence and state if the condition is met or not.
M. Perform the necessary calculations to assess the assumption of no multicollinearity and state if the condition is met or not.
N. Visually check the assumptions related to the residuals using the plot() and hist() functions. Summarize what each graph is informing you of and if any anomalies are present.
O. Overall, is this regression model unbiased? If an unbiased regression model, what does this tell us about the sample vs. the entire population model?
Report and discuss all of your calculations and critiques using R Markdown.
Note - Two exercises that contain two major parts where I need help with R-Studio programming - Need a detail RMarkdown report in return.
Attachment:- Assignment Files.rar