Draw a scatterplot of BMD versus age

Assignment Help Applied Statistics
Reference no: EM132359550

Assignment - Linear Regression Modelling

LMR MODULE CASE STUDY EXERCISES - Most of the exercises in this case study are based on the two simple research questions introduced in the Case Study accompanying Module 1. Some exercises involve derivation of results from the notes using straightforward algebra. Other questions are computational or graphical with interpretation of results "for the clinician". As for Module 1 Case Study, I strongly recommend that you create a Stata do file.

Part A - Model checking with a continuous covariate

In Question A of the case study for Module 1 we fitted a linear regression model to assess the association between age and BMD. We now want to assess whether the assumptions required for inference in the regression model appear reasonable.

1. RESIDUALS

(a) Obtain the fitted (predicted) values of the mean response using explicit calculation from the regression equation (in Stata, use the "generate" command) and verify the results against those produced by Stata's predict command. Calculate the residuals from the regression line similarly, and again verify against those produced by predict. Check that these residuals sum to zero.

(b) Now show algebraically that the residuals must sum to zero. [Hint: use Y^i = Y-b1(Xi - X-).]  

(c) Using a boxplot or histogram, determine if there are any obvious outlying observations.

(d) Calculate the standardised residuals. Plot the standardised residuals versus the predicted values and produce a normal probability plot of the residuals: does the latter indicate any substantial departures from normality?

2. NORMAL PROBABILITY PLOTS

In this question we ask you to perform some simulations to help you to develop a better feeling for random variation from both normal and non-normal distributions and for how this manifests itself in the normal probability plot.

NOTE: This is an exercise in how to run a small simulation study and on what to expect in terms of "normal" variation (from truly normal distributions and otherwise); it is important to remember, as we emphasise in the module notes, that normality of the error terms in a regression is not usually an important issue for the validity of linear model inferences (except in very small samples and with extreme departures from normality).

First, obtain 66 simulated normally distributed residuals (with variance 1) by generating 66 values from a standard normal distribution as follows:

  • set seed 7897345
  • clear
  • set obs 66
  • gen varname = rnormal()

where varname is the assigned variable name.

Some explanation of these commands:

The first command chooses a start point for the random number generator. You can choose any string of numbers up to 15 digits for the seed to initialise the generator; otherwise Stata always starts with 123456789! Note that a computer-generated "random number" cannot actually be truly random, since it is produced by a defined set of programming steps, but the programs used for this purpose generate a sequence of numbers that is close enough to random for practical purposes. Each "pseudo-"random number is produced by a calculation that uses the previous value, or the seed in the case of the first one within a particular session. Setting the seed is a good practice for enabling the user to reproduce the same sequence if they wish to. The "rnormal()" function produces a (pseudo-)random draw from the N(0,1) distribution.

(a) Produce a normal probability plot for these simulated residuals, and comment on its appearance. Do the points lie close to a straight line? Also compute the Shapiro-Wilk test for normality. Does the resulting P-value indicate any evidence against the null hypothesis of normality? Repeat this 5 times using different seeds.

(b) Generate residuals from a right skewed distribution (Chi-squared with 1 df) by squaring your residuals calculated above. Produce a normal probability plot, perform the Shapiro-Wilk normality test, and describe the results.

(c) Now generate residuals from a short-tailed distribution (e.g. Uniform(-2,2), by using -2+4*U[0,1]) and comment again.

(d) Finally, generate residuals from a heavy-tailed distribution. This can be achieved using a mixture of two normal distributions, say N(0,1) with probability 0.90 and N(0,9) with probability 0.10. To do this first generate a uniform random variable and then select the normal distribution to sample from according to its value:

  • gen u = runiform()
  • gen varname = (u>0.1)*rnormal()+(u<0.1)*3*rnormal()

Again, comment on the results.

(e) FOR DISCUSSION

Summarise your findings from the above simulations.

(f) FOR DISCUSSION

Return to your normal probability plot in question 1(d). Does your interpretation change at all after reviewing the normal probability plots generated in Question 2a?

3. INFLUENTIAL OBSERVATIONS

The clinician now asks you whether there are any individual observations exerting unduly large influence on the slope of the fitted regression line.

(a) Draw a scatterplot of BMD versus age and add a fitted regression line. Try to identify points on the scatterplot that exert influence on the slope of the fitted line.

(b) Compute the DFBETA statistic for each observation. Identify any particularly large values. Are these large enough to alter the substantive conclusions about the relationship between BMD and age? [You may want to draw a scatterplot of the DFBETAs against age to help you.]

(c) Redraw the scatterplot of BMD vs age from part (a) with "bubbles" proportional to the size of the DFBETAs. Does this assist your interpretation?

(d) Compute the leverage for each observation using Stata, and plot the leverages versus squared residuals using the lvr2plot command. Comment on whether any of the values have a particularly high leverage. Graph BMD versus age with "bubbles" proportional in size to the leverages, with the fitted regression line superimposed.

(e) FOR DISCUSSION

Comment on the pattern seen in the scatterplots from questions (c) and (d). In particular, are there any points with different sized bubbles in the two plots?

4. FOR ASSESSMENT

In terms of the assumptions made, does it seem appropriate (from the evidence available in the data) to fit a linear regression model to represent the relationship between BMD and age? Summarise the important findings from the model checking carried out in questions 1 to 3. This summary should include discussion of the validity of the main assumptions of linear regression, the residuals, outliers and influential observations. Your summary should include, in addition, up to three graphs to support your discussion on model checking. Finally, include a separate standalone summary that would be suitable for a clinician collaborator who had only a limited background in statistics. This collaborator would understand the concepts of hypothesis testing, Pvalues, confidence intervals, and the basic premise of a linear regression (as a line of best fit through some data), but would be unfamiliar with the assumptions of linear regression and the technical details of such an analysis). This clinical summary should include an interpretation of the of the key parameter(s) of the regression results, relevant confidence interval(s), relevant P-values, an interpretation of the strength of the linear relationship (if any), and a brief summary of the model checking and validity of the regression analysis.

Part B - Model checking with a binary covariate

In Part B of the case study for Module 1 we looked at the effect of ever having steroids on BMD. We now want to assess the assumptions underlying regressionbased inference for this question. Begin by refitting the simple linear regression as in Module 1.

(a) To remind yourself of the question, draw a scatterplot of BMD versus steroid use. Calculate the ordinary residuals.

(b) FOR DISCUSSION

Without constructing the scatterplot of residuals versus predicted values in Stata, what do you think the plot will look like? Be as specific as you can. [Hint: Look at the scatterplot from (a) and consider the nature of the predicted values from the regression model.] Now produce the residual versus predicted value plot (rvfplot) and update your answer if necessary.

(c) Compute the standardised residuals and assess them for evidence of outliers and non-normality.

(d) FOR ASSESSMENT

Using the formula in the Module 2 notes, derive the general formula for leverage in a regression model using a single binary covariate with n11's and n00's?

Hint: how many unique values of leverage will there be for this regression analysis?

(e) FOR ASSESSMENT

Compute the leverage values for this dataset. Comment on when would a binary covariate produce highly unequal values of leverage? In what situation would this be most extreme?

Hint: Is it possible for the leverage to be equal to 1? In what circumstance? What would that mean?

[Note: if you are struggling with the algebra in part (d), you should still be able to go a fair way with this question, for example by playing around with the data (hint: leverage is about the X values!) in various ways and seeing what happens to the leverage values...]

Reference no: EM132359550

Questions Cloud

Explain step-by-step process of conducting dismissal meeting : Dismissal Meeting - Imagine that you are an office manager and you have been tasked with the job of coordinating and heading the dismissal meeting.
Calculate the missing amounts : The records of Healthy Soft Drinks show the following figures. Calculate the missing amounts. Employee Earnings Salaries for the month.
Identify the drawbacks of innovative selection procedure : Identify the possible drawbacks and opportunities of the new and seemingly innovative selection procedure illustrated in this week's case study.
Manage quality customer service in a hypothetical workplace : Produce a short procedure identifying how you would manage quality customer service in a hypothetical workplace. This procedure should include information
Draw a scatterplot of BMD versus age : Assignment - Linear Regression Modelling, Draw a scatterplot of BMD versus age and add a fitted regression line
What is the breakeven point in units : What is the contribution margin per unit for the Stack-o-Choc candy bar? What is the contribution margin ratio for the Stack-o-Choc candy bar?
Leadership decisions and organizational directions : How do consumer decisions affect leadership decisions and organizational directions?
Monitoring of customer service standards : List and describe the 4 steps to effective monitoring of customer service standards?
What do you feel one of given reward is a stronger motivator : What do you feel is a stronger motivator: intrinsic or extrinsic rewards? Why do you feel this way? Answer the above discussion and then reply to my classmate.

Reviews

Write a Review

Applied Statistics Questions & Answers

  Hypothesis testing

What assumptions about the number of pedestrians passing the location in an hour are necessary for your hypothesis test to be valid?

  Calculate the maximum reduction in the standard deviation

Calculate the maximum reduction in the standard deviation

  Calculate the expected value, variance, and standard deviati

Calculate the expected value, variance, and standard deviation of the total income

  Determine the impact of social media use on student learning

Research paper examines determine the impact of social media use on student learning.

  Unemployment survey

Find a statistics study on Unemployment and explain the five-step process of the study.

  Statistical studies

Locate the original poll, summarize the poling procedure (background on how information was gathered), the sample surveyed.

  Evaluate the expected value of the total number of sales

Evaluate the expected value of the total number of sales

  Statistic project

Identify sample, population, sampling frame (if applicable), and response rate (if applicable). Describe sampling technique (if applicable) or experimental design

  Simple data analysis and comparison

Write a report on simple data analysis and comparison.

  Analyze the processed data in statistical survey

Analyze the processed data in Statistical survey.

  What is the probability

Find the probability of given case.

  Frequency distribution

Accepting Manipulation or Manipulating

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd