Reference no: EM132636281
ICT513 Data Analytics - Murdoch University
Question 1.
(a) Explore the data and present some appropriate descriptive statis- tics and graphical displays for each variable (excluding mother identifier). Also explore the relationship between milk production and infant gender. Provide some comments on your explorations including on the data quality (presence of extreme value, missing values etc).
(b) Using simple linear regression, assess whether infant gender is as- sociated with daily milk production. Estimate the difference be- tween male and female infants. Support your response to the research question ‘Is infant gender associated with daily milk pro- duction?' with an appropriate confidence interval.
(c) Propose an appropriate multiple linear regression model to assess the relationship between daily milk production and infant gender that allows for possible cofounders. Clearly explain why you chose each of the explanatory variables included in your multiple linear regression model. Note fitting is carried out in part e, here present the notation.
(d) Assess the variables you have chosen as explanatory variables for collinearity in part (c). (Note: You will need to search for an appropriate R function to assess collinearity based on the measure discussed in lecture. You may find several options are available, some of which are much easier to use than others. You should report the function you have used as well as the R package it is found in.) Should any of your proposed explanatory variables be removed?
(e) Based on parts (c) and (d), fit an appropriate multiple linear re- gression model to assess the relationship between milk production and infant gender. Provide an interpretation of the model sup- ported with confidence intervals
Question 2.
Consider the task of attempting to predict daily milk production
(a) Consider candidate linear models where daily milk production is the response variable and predictors are given by:
• Baby gender
• Birth weight
• Maternal body mass index
• Maternal health
Using 100 bootstrap replicates and 100 repetitions of ten-fold cross-validation, produce bootstrap .632+ method and cross-validation estimators of MSE and RMSE for the four candidate models given
by appropriate simple linear regressions of milk production on each of the identified predictors, and present a table of these estimators side-by-side.
(b) Using the most important predictor identified in the previous part, consider all possible models that include this predictor as well as some combination of the other three predictors. (In other words, the candidate models considered for this part will all include the best predictor, and they will consider all possible combinations of this predictor with one or more of the other three predictors.) For these seven candidate models as well as the best model identified in the previous part, which is the best for prediction purposes as based on cross validation? (For this part, use 100 repetitions of ten-fold cross-validation. You do not need to consider bootstrap prediction.) (5 marks)
(c) Describe some enhancement to this analysis (you can even sug- gest further data to collect, other suitable research questions or techniques you think would be suitably applied in this area of research). (4 marks)
Question 3.
In your own words describe the .632+ method and an advantage of using this method.
Question 4.
Report presentation marks
These marks are allocated based on:
• structure, clarity, and tidiness of presented solutions/answers,
• correctness in spelling and grammar, and 5.
Coding marks
When submitted, this script file should have a name given by Assign- ment 2 SURNAME.R, where SURNAME is replaced by your surname. Your R script will be marked based on:
Readability of code: This includes the use of informative com- menting to make it clear what blocks of code are meant to do, descriptive variable names, and appropriate use of spacing to sep- arate blocks of code meant to perform different functions.
Accuracy of code: This includes the correct specification of func- tions to produce the results reported in your assignment and whether I am able to run your entire script file without producing any errors. It is important that you verify that your code runs error-free from start to finish before submitting.
Efficiency: This includes writing a script that uses minimal lines of code, is easily adapted to new datasets or slight modifications to the existing dataset, and runs quickly.
Attachment:- Data Analytics.rar