Reference no: EM132345133
Graphs Assignment -
General Instructions - There are 5 exercises. You are required to solve at least one exercise in R, and at least one in SAS.
Experimental - Again, you will be allowed to provide one solution using Python. Elaborate on the similarities and differences between Python function definitions and R or IML or Macro language.
Exercise 1 -
Part a - Load the ncaa2018.csv data set and create histograms, QQ-norm and box-whisker plots for ELO. Add a title to each plot, identifying the data.
Part b - A common recommendation to address issues of non-normality is to transform data to correct for skewness. One common transformation is the log transform.
Transform ELO to log(ELO) and produce histograms, box-whisker and qqnorm plots of the transformed values. Are the transformed values more or less skewed than the original? (Note - the log transform is used to correct skewness, it is less useful for correcting kurtosis).
Exercise 2 -
Review Exercise 4, Homework 6, where you calculated skewness and kurtosis. We will reproduce the histograms, and add qqnorm and box-whisker plots.
Part a - Use the code below from lecture to draw 1000 samples from the normal distribution.
norm.sample <- rnorm(1000, mean=0, sd=1)
Look up the corresponding r* functions in R for the Cauchy distribution (use location=0, scale=1), and the Weibull distribution (use shape = 1.5). For the double exponential, use you can use the *laplace functions from the rmutil library, or you can use rexp(1000) - rexp(1000)
Draw 1000 samples from each of these distributions. Calculate skewness and kurtosis for each sample. You may use your own function, or use the moments library.
Part b - Plot the histograms for each distribution. Use par(mfrow=c(2,2)) in your code chunk to combine the four histogram in a single plot. Add titles to the histograms indicating the distribution. Set the x-axis label to show the calculated skewness and kurtosis, i.e. skewness = ####, kurtosis = #### par(mfrow=c(2,2))
Part c - Repeat Part b, but with QQ-norm plots.
Part d - Repeat Part b, but with box-whisker plots.
Exercise 3 -
Part a - We will create a series of graphs illustrating how the Poisson distribution approaches the normal distribution with large λ. We will iterate over a sequence of lambda, from 2 to 64, doubling lambda each time. For each 'lambda' draw 1000 samples from the Poisson distribution.
Calculate the skewness of each set of samples, and produce histograms, QQ-norm and box-whisker plots. You can use par(mfrow=c(1,3)) to display all three for one lambda in one line. Add lambda=## to the title of the histogram, and skewness=## to the title of the box-whisker plot.
Part b - Remember that lambda represents the mean of a discrete (counting) variable. At what size mean is Poisson data no longer skewed, relative to normally distributed data? You might run this 2 or 3 times, with different seeds; this number varies in my experience.
par(mfrow=c(1,3))
If you do this in SAS, create a data table with data columns each representing a different µ. You can see combined histogram, box-whisker and QQ-norm, for all columns, by calling proc univariate data=Distributions plot;
run;
At what µ is skewness of the Poisson distribution small enough to be considered normal?
Exercise 4 -
Part a - Write a function that accepts a vector vec, a vector of integers, a main axis label and an x axis label. This function should 1. iterate over each element i in the vector of integers 2. produce a histogram for vec setting the number of bins in the histogram to i 3. label main and x-axis with the specified parameters. 4. label the y-axis to read Frequency, bins = and the number of bins.
Hint: You can simplify this function by using the parameter ... - see ?plot or ?hist
Part b - Test your function with the hidalgo data set (see below), using bin numbers 12, 36, and 60. You should be able to call your function with something like
plot.histograms(hidalgo.dat[,1],c(12,36,60), main="1872 Hidalgo issue",xlab= "Thickness (mm)")
to plot three different histograms of the hidalgo data set.
If you do this in SAS, write a macro that accepts a table name, a column name, a list of integers, a main axis label and an x axis label. This macro should scan over each element in the list of integers and produce a histogram for each integer value, setting the bin count to the element in the input list, and labeling main and x-axis with the specified parameters. You should label the y-axis to read Frequency, bins = and the number of bins.
Test your macro with the hidalgo data set (see below), using bin numbers 12, 36, and 60. You should be able to call your macro with something like
%plot_histograms(hidalgo, y, 12 36 60, main="1872 Hidalgo issue", xlabel="Thickness (mm)");
to plot three different histograms of the hidalgo data set.
Hint: Assume 12 36 60 resolve to a single macro parameter and use %scan. Your macro definition can look something like
%macro plot_histograms(table_name, column_name, number_of_bins, main="Main", xlabel="X Label")
Exercise 5 -
We've been working with data from Wansink and Payne, Table 1:
Reproducing part of Wansink Table 1 (see attached file)
However, in Homework 2, we also considered the value given in the text
The resulting increase of 168.8 calories (from 268.1 calories . . . to 436.9 calories . . . ) represents a 63.0% increase . . . in calories per serving.
There is a discrepancy between two values reported for calories per serving, 2006. We will use graphs to attempt to determine which value is most consistent.
First, consider the relationship between Calories per Serving and Calories per Recipe:
Calories per Serving = Calories per Recipe / Servings per Recipe
Since Servings per Recipe is effectively constant over time (12.4-13.0), we can assume the relationship between Calories per Serving and Calories per Recipe is linear,
Calories per Serving = β0 + β1 × Calories per Recipe
with Servings per Recipe = 1/β1
We will fit a linear model, with Calories per Recipe as the independent variable against two sets of values for Calories per Serving, such that
- Assumption 1. The value in the table (384.4) is correct.
- Assumption 2. The value in the text (436.9) is correct.
We use the data:
Part a - Plot the regression. Use points to plot Assumption1 vs CaloriesPerRecipe, and Assumption2 vs CaloriesPerRecipe, on the same graph. Add lines (i.e. abline) to show the ?t from the regression. Use different colors for the two assumptions. Which of the two lines appears to best explain the data?
Part b - Produce diagnostic plots of the residuals from both linear models (in R, use residuals(Assumption1.lm)). qqnorm or box-whisker plots will probably be the most effective; there are too few points for a histogram. Use the code below to place two plots, side by side. You can produce more than one pair of plots, if you wise.
par(mfrow=c(1,2))
par(mfrow=c(1,2))
From these plots, which assumption is most likely correct. That is, which assumption produces a linear model that least violates assumptions of normality of the residual errors? Which assumption produces outliers in the residuals?
I've included similar data and linear models for SAS in the SAS template. If you choose SAS, you will need to modify the PROC GLM code to produce the appropriate diagnostic plots.
Attachment:- Graphs Assignment Files.rar