Draw a scatterplot of BMD versus age

Assignment Help Applied Statistics
Reference no: EM132359550

Assignment - Linear Regression Modelling

LMR MODULE CASE STUDY EXERCISES - Most of the exercises in this case study are based on the two simple research questions introduced in the Case Study accompanying Module 1. Some exercises involve derivation of results from the notes using straightforward algebra. Other questions are computational or graphical with interpretation of results "for the clinician". As for Module 1 Case Study, I strongly recommend that you create a Stata do file.

Part A - Model checking with a continuous covariate

In Question A of the case study for Module 1 we fitted a linear regression model to assess the association between age and BMD. We now want to assess whether the assumptions required for inference in the regression model appear reasonable.

1. RESIDUALS

(a) Obtain the fitted (predicted) values of the mean response using explicit calculation from the regression equation (in Stata, use the "generate" command) and verify the results against those produced by Stata's predict command. Calculate the residuals from the regression line similarly, and again verify against those produced by predict. Check that these residuals sum to zero.

(b) Now show algebraically that the residuals must sum to zero. [Hint: use Y^i = Y-b1(Xi - X-).]  

(c) Using a boxplot or histogram, determine if there are any obvious outlying observations.

(d) Calculate the standardised residuals. Plot the standardised residuals versus the predicted values and produce a normal probability plot of the residuals: does the latter indicate any substantial departures from normality?

2. NORMAL PROBABILITY PLOTS

In this question we ask you to perform some simulations to help you to develop a better feeling for random variation from both normal and non-normal distributions and for how this manifests itself in the normal probability plot.

NOTE: This is an exercise in how to run a small simulation study and on what to expect in terms of "normal" variation (from truly normal distributions and otherwise); it is important to remember, as we emphasise in the module notes, that normality of the error terms in a regression is not usually an important issue for the validity of linear model inferences (except in very small samples and with extreme departures from normality).

First, obtain 66 simulated normally distributed residuals (with variance 1) by generating 66 values from a standard normal distribution as follows:

  • set seed 7897345
  • clear
  • set obs 66
  • gen varname = rnormal()

where varname is the assigned variable name.

Some explanation of these commands:

The first command chooses a start point for the random number generator. You can choose any string of numbers up to 15 digits for the seed to initialise the generator; otherwise Stata always starts with 123456789! Note that a computer-generated "random number" cannot actually be truly random, since it is produced by a defined set of programming steps, but the programs used for this purpose generate a sequence of numbers that is close enough to random for practical purposes. Each "pseudo-"random number is produced by a calculation that uses the previous value, or the seed in the case of the first one within a particular session. Setting the seed is a good practice for enabling the user to reproduce the same sequence if they wish to. The "rnormal()" function produces a (pseudo-)random draw from the N(0,1) distribution.

(a) Produce a normal probability plot for these simulated residuals, and comment on its appearance. Do the points lie close to a straight line? Also compute the Shapiro-Wilk test for normality. Does the resulting P-value indicate any evidence against the null hypothesis of normality? Repeat this 5 times using different seeds.

(b) Generate residuals from a right skewed distribution (Chi-squared with 1 df) by squaring your residuals calculated above. Produce a normal probability plot, perform the Shapiro-Wilk normality test, and describe the results.

(c) Now generate residuals from a short-tailed distribution (e.g. Uniform(-2,2), by using -2+4*U[0,1]) and comment again.

(d) Finally, generate residuals from a heavy-tailed distribution. This can be achieved using a mixture of two normal distributions, say N(0,1) with probability 0.90 and N(0,9) with probability 0.10. To do this first generate a uniform random variable and then select the normal distribution to sample from according to its value:

  • gen u = runiform()
  • gen varname = (u>0.1)*rnormal()+(u<0.1)*3*rnormal()

Again, comment on the results.

(e) FOR DISCUSSION

Summarise your findings from the above simulations.

(f) FOR DISCUSSION

Return to your normal probability plot in question 1(d). Does your interpretation change at all after reviewing the normal probability plots generated in Question 2a?

3. INFLUENTIAL OBSERVATIONS

The clinician now asks you whether there are any individual observations exerting unduly large influence on the slope of the fitted regression line.

(a) Draw a scatterplot of BMD versus age and add a fitted regression line. Try to identify points on the scatterplot that exert influence on the slope of the fitted line.

(b) Compute the DFBETA statistic for each observation. Identify any particularly large values. Are these large enough to alter the substantive conclusions about the relationship between BMD and age? [You may want to draw a scatterplot of the DFBETAs against age to help you.]

(c) Redraw the scatterplot of BMD vs age from part (a) with "bubbles" proportional to the size of the DFBETAs. Does this assist your interpretation?

(d) Compute the leverage for each observation using Stata, and plot the leverages versus squared residuals using the lvr2plot command. Comment on whether any of the values have a particularly high leverage. Graph BMD versus age with "bubbles" proportional in size to the leverages, with the fitted regression line superimposed.

(e) FOR DISCUSSION

Comment on the pattern seen in the scatterplots from questions (c) and (d). In particular, are there any points with different sized bubbles in the two plots?

4. FOR ASSESSMENT

In terms of the assumptions made, does it seem appropriate (from the evidence available in the data) to fit a linear regression model to represent the relationship between BMD and age? Summarise the important findings from the model checking carried out in questions 1 to 3. This summary should include discussion of the validity of the main assumptions of linear regression, the residuals, outliers and influential observations. Your summary should include, in addition, up to three graphs to support your discussion on model checking. Finally, include a separate standalone summary that would be suitable for a clinician collaborator who had only a limited background in statistics. This collaborator would understand the concepts of hypothesis testing, Pvalues, confidence intervals, and the basic premise of a linear regression (as a line of best fit through some data), but would be unfamiliar with the assumptions of linear regression and the technical details of such an analysis). This clinical summary should include an interpretation of the of the key parameter(s) of the regression results, relevant confidence interval(s), relevant P-values, an interpretation of the strength of the linear relationship (if any), and a brief summary of the model checking and validity of the regression analysis.

Part B - Model checking with a binary covariate

In Part B of the case study for Module 1 we looked at the effect of ever having steroids on BMD. We now want to assess the assumptions underlying regressionbased inference for this question. Begin by refitting the simple linear regression as in Module 1.

(a) To remind yourself of the question, draw a scatterplot of BMD versus steroid use. Calculate the ordinary residuals.

(b) FOR DISCUSSION

Without constructing the scatterplot of residuals versus predicted values in Stata, what do you think the plot will look like? Be as specific as you can. [Hint: Look at the scatterplot from (a) and consider the nature of the predicted values from the regression model.] Now produce the residual versus predicted value plot (rvfplot) and update your answer if necessary.

(c) Compute the standardised residuals and assess them for evidence of outliers and non-normality.

(d) FOR ASSESSMENT

Using the formula in the Module 2 notes, derive the general formula for leverage in a regression model using a single binary covariate with n11's and n00's?

Hint: how many unique values of leverage will there be for this regression analysis?

(e) FOR ASSESSMENT

Compute the leverage values for this dataset. Comment on when would a binary covariate produce highly unequal values of leverage? In what situation would this be most extreme?

Hint: Is it possible for the leverage to be equal to 1? In what circumstance? What would that mean?

[Note: if you are struggling with the algebra in part (d), you should still be able to go a fair way with this question, for example by playing around with the data (hint: leverage is about the X values!) in various ways and seeing what happens to the leverage values...]

Reference no: EM132359550

Questions Cloud

Explain step-by-step process of conducting dismissal meeting : Dismissal Meeting - Imagine that you are an office manager and you have been tasked with the job of coordinating and heading the dismissal meeting.
Calculate the missing amounts : The records of Healthy Soft Drinks show the following figures. Calculate the missing amounts. Employee Earnings Salaries for the month.
Identify the drawbacks of innovative selection procedure : Identify the possible drawbacks and opportunities of the new and seemingly innovative selection procedure illustrated in this week's case study.
Manage quality customer service in a hypothetical workplace : Produce a short procedure identifying how you would manage quality customer service in a hypothetical workplace. This procedure should include information
Draw a scatterplot of BMD versus age : Assignment - Linear Regression Modelling, Draw a scatterplot of BMD versus age and add a fitted regression line
What is the breakeven point in units : What is the contribution margin per unit for the Stack-o-Choc candy bar? What is the contribution margin ratio for the Stack-o-Choc candy bar?
Leadership decisions and organizational directions : How do consumer decisions affect leadership decisions and organizational directions?
Monitoring of customer service standards : List and describe the 4 steps to effective monitoring of customer service standards?
What do you feel one of given reward is a stronger motivator : What do you feel is a stronger motivator: intrinsic or extrinsic rewards? Why do you feel this way? Answer the above discussion and then reply to my classmate.

Reviews

Write a Review

Applied Statistics Questions & Answers

  Calculate the shewhart monitoring limits

Calculate the ARL and look at the chart to see if the number looks about right. Use the time information in the raw data and your ARL value to calculate how many minutes between a false alarm. Will the operators be happy with this?

  There is an old belief that red cars get pulled over

There is an old belief that red cars get pulled over for speeding more frequently than cars of other colors. To test whether in fact there is any difference in speeding (as opposed to just biased enforcement), a random sample of red cars and non-red ..

  The sampling distribution of the sample mean.

Find the standard deviation of the sampling distribution of the sample mean.

  Population variance for the height of the students

From an undergraduate class, a sample of 18 students is taken. The sample variance of their height was calculated to be equal to 36. Develop a 90% confidence interval of the population variance for the height of the students of this class.

  A major dvd rental chain is considering opening a new store

A major dvd rental chain is considering opening a new store in an area that currently does not have any such stores. The chain will open if there is evidence that more than 5,000 of the 20,000 households in the area are equipped with dvd playe..

  What you would present to the CEO and specific data needs

Bring to discussion your professional opinion about what you would present to the CEO, and specific data needs and measures/statistics to support your judgment

  Setting up a linear programming problem

Identify any challenges you have in setting up a linear programming problem in Excel, and solving it with Solver. Explain exactly what the challenges are and why they are challenging. Identify resources that can help you with that.

  Briefly describe the shape of the distribution

Build a scatter plot if you have two or more interval ratio variables. What type of relationship, if any, can you observe between the variables - Briefly describe the shape of the distribution, making note of its overall shape and also looking for ..

  What would be the 80th percentile for height

What would be the 80th percentile for height?

  Compare the discriminant analysis and logistic regression

The director of the MBA program at Salterdine University wants to develop a procedure to determine which applicants to admit to the MBA program. Compare the discriminant analysis and the logistic regression. Which one is more accurate

  An initial survey was performed right after médecins sans

An initial survey was performed right after Médecins Sans Frontières accused my brother of wrong doing. Of 1852 customers, 53 were against the aggressive tactics of Médecins Sans Frontières. After my brother was cleared by the court, a follow..

  Calculate the weekly return and construct a histogram

ECON1030 - Business Statistics - RMIT University - Calculate and interpret the three aspects of Descriptive Analysis, Location, Shape and Spread

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd