Reference no: EM133430032
Framingham Heart Study: Statistical Analysis, Industry Applied Activity
Introduction
This activity uses the HEART dataset in the SASHELP library. To access the SASHELP library in SAS, select View Explorer. In the Explorer window, select Libraries Sashelp. The data came from the landmark Framingham Heart Study. The purpose of the Framingham Heart Study was to identify characteristics contributing to cardiovascular disease. Important links between cardiovascular disease and high blood pressure, high cholesterol levels, cigarette smoking, and many other health factors were first established using its data.
The original cohort of the Framingham Heart study consisted of 5,209 men and women between the ages of 28 and 62 living in Framingham, Massachusetts. The first visit of data collection for participants in this cohort occurred between 1948 and 1953, and participants were assessed every two years thereafter through April 2014-almost 7 decades!
The complete Framingham Heart Study data consists of hundreds of datasets taken over time at 32 biennial exams and has led to over 3000 (wow!) published journal articles. To simplify analyses for illustrative purposes, the SASHELP.HEART dataset includes a snapshot of selected primary study variables taken at one of the biennial exams.
Framingham Heart Study: Data Preparation Activity
This activity is comprised of two parts. Part one outlines how to explore the data to understand the variables for analysis. Part two outlines how to prepare the data for future analyses by creating new variables and subsetting the data.
Part 1: Understanding the Variables
Deciding an appropriate path for analysis often requires many steps. An important first step is exploring and examining the data. An initial exploratory data analysis provides understanding of the meaning of study variables and can provide crucial clues into data preparations needed before analyzing the data.
Open and examine the SASHELP.HEART dataset and its variables. Familiarize yourself with the context and meanings behind the variables and their values.
How many observations are in the dataset?
How many variables are in the dataset? How many are numeric? How many are character?
Exploring the assigned values of character variables can demonstrate patterns and inherent orderings. The default ordering of levels in SAS is alphabetical order. The levels of many character variables have an inherent ordering of magnitude. For example, non-smokers smoke less than light smokers who smoke less than moderate smokers.
Question
Tabulate the levels of the character variables in the SASHELP.HEART dataset. For each of the character variables:
What data values or levels are observed for each?
Which variables have an inherent ordering of magnitude? Does alphabetical order of the levels correspond to ordering levels by magnitude for any of these character variables?
Examining the values of numeric variables can provide insights into their magnitude, spread, and symmetry. Variables with a symmetric distribution will have roughly equal mean and median, so can be summarized with either statistic. Variables with substantially different mean and median values indicate a non-symmetric distribution. Such variables may be better summarized with a median. Additionally, some numeric variables may have few unique values, so could be better summarized as categorical variables.
Question
Generate descriptive statistics and histograms for the numeric variables in the SASHELP.HEART dataset.
What is the minimum, maximum, median, and mean of each variable?
Do the mean and median seem substantially different for any of the variables?
Does Smoking seem to be better suited to be analyzed as a categorical variable or a continuous variable?
The SASHELP.HEART dataset contains several categorical variables whose levels were originally created from values of continuous variables in the dataset. Understanding the relationships between related continuous and categorical predictors in a dataset can inform choices of predictors in later statistical analyses.
Question
Explore the variables Weight_Status, Smoking_Status, Chol_Status, and BP_Status as follows:
Variables Weight_Status, MRW, and Weight:
What are the ranges (minimum and maximum) of variables MRW and Weight for each level of Weight_Status?
Are the ranges of MRW for levels of Weight_Status overlapping?
Are the ranges of Weight for levels of Weight_Status overlapping?
Using your answers to the previous two questions, when this dataset was created which values, MRW or Weight, were used to create the levels for Weight_Status?
Variables Smoking_Status and Smoking:
Which values of Smoking are categorized as Smoking_Status=Non-smoker? Light? Moderate? Heavy? Very Heavy?
Are any values of Smoking categorized into more than one level of Smoking_Status?
Variables Chol_Status and Cholesterol:
What are the ranges (minimum and maximum) of Cholesterol for each level of Chol_Status?
Are the ranges of Cholesterol for levels of Chol_Status overlapping?
Variable BP_Status:
What are the ranges (minimum and maximum) of Diastolic and Systolic for each level of BP_Status?
Are the ranges of Diastolic for levels of BP_Status overlapping?
Are the ranges of Systolic for levels of BP_Status overlapping?
Normal levels of blood pressure are usually defined as under 120 for systolic blood pressure and under 80 for diastolic blood pressure. Based on your answers to the previous questions, are one or both of systolic and diastolic blood pressure required to be high for the individual to be categorized as BP_Status=High?
Exploring patterns of missingness in a dataset gives insight into data collection procedures for the study generating the dataset and may also indicate data entry or data collection errors.
Question
Examine missing data in the SASHELP.HEART dataset.
Which variables have no missing data?
Which variables have missing data?
For each variable with missing data, what percent of the data is missing?
Using what you currently know about the dataset, given the definition of the variable(s) or given values of other variables in the dataset, which variable(s) have patterns of missingness that could be expected?
Question
Examine patterns of missingness on certain groups of variables as follows:
If MRW is non-missing, are both Height and Weight always non-missing?
If Weight_Status is non-missing, are both Height and Weight always non-missing?
If Smoking is non-missing is Smoking_Status always non-missing, and vice versa?
If Cholesterol is non-missing, is Chol_Status always non-missing, and vice versa?
Analyze DeathCause and AgeAtDeath grouped by Status.
Are DeathCause and AgeAtDeath ever missing when Status=Dead?
Are DeathCause and AgeAtDeath ever non-missing when Status=Alive?
Analyze AgeCHDdiag grouped by DeathCause. Is AgeCHDDiag ever missing when DeathCause=Coronary Heart Disease?
Missing values can also impact later statistical analyses. SAS statistical procedures perform what is called a complete case analysis, which is to say that analyses will exclude any observation with a missing value for any variable involved in the analysis. Such exclusions can substantially decrease the number of observations in a dataset that are used in a later statistical analysis.
Question
Tabulate the percent of observations in the SASHELP.HEART dataset that have non-missing values for all the predictor variables that you will use in later analyses: AgeAtStart, BP_Status, Chol_Status, Cholesterol, Diastolic, Height, MRW, Sex, Smoking, Smoking_Status, Systolic, Weight, and Weight_Status.
Does the SASHELP.HEART dataset seem to have a high amount of missing data for any of these predictors?
Part 2: Creating New Variables and Subsetting the Data
An important next step after exploring a dataset is to create any new variables needed for later analyses. The primary outcome of the Framingham Heart study is whether a patient developed coronary heart disease. Interestingly, this variable is not included in the SASHELP.HEART dataset.
Use information in the variable AgeCHDdiag to create a variable describing whether a patient developed coronary heart disease. Specifically, if AgeCHDdiag is non-missing, then the individual had coronary heart disease, and if AgeCHDdiag is missing, the individual did not have coronary heart disease.
Create a new numeric variable named CHD.
Store this new variable in a temporary dataset named WORK.HEART1.
Code this variable so that CHD= 1 if AgeCHDdiag takes a value from 0 to 999 and CHD= 0 otherwise.
After creating any new variable, make sure to check your work.
Generate descriptive statistics for the variable AgeCHDdiag grouped by CHD.
Is CHD a numeric variable?
When CHD=1, is AgeCHDdiag always non-missing?
When CHD=0, is AgeCHDdiag always missing?
Let's now turn to creating new predictor variables.
Statistical analyses can determine which variables collected in the Framingham Heart Study are predictive of development of coronary heart disease. To facilitate comparison of levels of categorial predictors, levels of categorial predictors must be recoded so that alphabetical order of the levels also corresponds to ordering the levels by magnitude. This is desirable since statistical procedures use the alphabetic last level as a reference level by default. Re-coding is also useful so that levels appear in a logical order in plots.
Re-code categorial variables in the SASHELP.HEART dataset as follows:
Use WORK.HEART1 as the input dataset.
Create an output dataset named WORK.HEART2.
Create a new variable Chol_StatusNew by recoding Chol_Status as follows:
High = 1 High
Borderline = 2 Borderline
Desirable = 3 Desirable
Create a new variable Sex_New by recoding Sex as follows:
Male = 1 Male
Female = 2 Female
Create a new variable Weight_StatusNew by recoding Weight_Status as follows:
Overweight = 1 Overweight
Normal = 2 Normal
Underweight = 3 Underweight
Create a new variable Smoking_StatusNew by recoding Smoking_Status as follows:
Very Heavy (> 25) = 1 Very Heavy
Heavy (16-25) = 2 Heavy
Moderate (6-15) = 3 Moderate
Light (1-5) = 4 Light
Non-smoker = 5 Non-smoker
Tabulate each of your new variables as follows to check your work:
Tabulate levels of each of the four new variables over all observations.
Tabulate levels of Chol_StatusNew grouped by Chol_Status.
Tabulate levels of Sex_New grouped by Sex.
Tabulate levels of Weight_StatusNew grouped by Weight_Status.
Tabulate levels of Smoking_StatusNew grouped by Smoking_Status.
Do you see the expected ordering of levels within each variable (in part i) as well as the expected combinations of levels of re-coded and original variables (in parts ii-v)?
We have now finished creating new variables.
In part 1, question 7, you tabulated the amount of missing data for the set of predictor variables of interest in the SASHELP.HEART dataset. From this, you noticed that only a small percentage (<5%) of observations in the SASHELP.HEART dataset have missing data for any of these variables.
Ideally, statistical analyses for the SASHELP.HEART dataset should be performed only on observations with no missing data for all these predictors. This ensures that all analyses, regardless of the predictors included, use the same number of observations. Given that the amount of missing data is small, analyses can simply exclude any observation with missing data on at least one of the predictors of interest. Other strategies such as single or multiple imputation could be employed, but those are beyond the scope of this exercise.
Create a new permanent dataset that can be used for later statistical analyses.
Use WORK.HEART2 as the input dataset.
Create a library named HEARTLIB.
Create an output dataset named HEARTLIB.MYHEART that contains only those observations that have non-missing values for the variables below:
AgeAtStart
|
Height
|
Systolic
|
BP_Status
|
MRW
|
Weight
|
Chol_StatusNew
|
Sex_New
|
Weight_Status
|
Cholesterol
|
Smoking
|
Weight_Status2
|
Diastolic
|
Smoking_StatusNew
|
|
This dataset should have 5039 observations.
Check your work for the dataset HEARTLIB.MYHEART by tabulating values of character variables and generating descriptive statistics for numeric variables. Do you see any missing values in any of the tabulations or statistics generated?
Congratulations- you have completed data preparation for the Framingham Heart Study dataset! A next step in exploring relationships between coronary heart disease and predictors of interest is to perform additional descriptive analyses by creating logit plots.
The related Framingham Heart Study: Descriptive Analysis, Industry Applied Activity provides practice in generating logit plots. Following this, logistic regression models can be fit to formalize the statistical relationships between coronary heart disease and predictors of interest. The related Framingham Heart Study: Statistical Analysis, Industry Applied Activity provides practice in fitting these logistic regression models. These activities can be found in the Academic Hub.