Create histograms QQ-norm and box-whisker plots for ELO

Assignment Help Other Subject

Reference no: EM132345133

Graphs Assignment -

General Instructions - There are 5 exercises. You are required to solve at least one exercise in R, and at least one in SAS.

Experimental - Again, you will be allowed to provide one solution using Python. Elaborate on the similarities and differences between Python function definitions and R or IML or Macro language.

Exercise 1 -

Part a - Load the ncaa2018.csv data set and create histograms, QQ-norm and box-whisker plots for ELO. Add a title to each plot, identifying the data.

Part b - A common recommendation to address issues of non-normality is to transform data to correct for skewness. One common transformation is the log transform.

Transform ELO to log(ELO) and produce histograms, box-whisker and qqnorm plots of the transformed values. Are the transformed values more or less skewed than the original? (Note - the log transform is used to correct skewness, it is less useful for correcting kurtosis).

Exercise 2 -

Review Exercise 4, Homework 6, where you calculated skewness and kurtosis. We will reproduce the histograms, and add qqnorm and box-whisker plots.

Part a - Use the code below from lecture to draw 1000 samples from the normal distribution.

norm.sample <- rnorm(1000, mean=0, sd=1)

Look up the corresponding r* functions in R for the Cauchy distribution (use location=0, scale=1), and the Weibull distribution (use shape = 1.5). For the double exponential, use you can use the *laplace functions from the rmutil library, or you can use rexp(1000) - rexp(1000)

Draw 1000 samples from each of these distributions. Calculate skewness and kurtosis for each sample. You may use your own function, or use the moments library.

Part b - Plot the histograms for each distribution. Use par(mfrow=c(2,2)) in your code chunk to combine the four histogram in a single plot. Add titles to the histograms indicating the distribution. Set the x-axis label to show the calculated skewness and kurtosis, i.e. skewness = ####, kurtosis = #### par(mfrow=c(2,2))

Part c - Repeat Part b, but with QQ-norm plots.

Part d - Repeat Part b, but with box-whisker plots.

Exercise 3 -

Part a - We will create a series of graphs illustrating how the Poisson distribution approaches the normal distribution with large λ. We will iterate over a sequence of lambda, from 2 to 64, doubling lambda each time. For each 'lambda' draw 1000 samples from the Poisson distribution.

Calculate the skewness of each set of samples, and produce histograms, QQ-norm and box-whisker plots. You can use par(mfrow=c(1,3)) to display all three for one lambda in one line. Add lambda=## to the title of the histogram, and skewness=## to the title of the box-whisker plot.

Part b - Remember that lambda represents the mean of a discrete (counting) variable. At what size mean is Poisson data no longer skewed, relative to normally distributed data? You might run this 2 or 3 times, with different seeds; this number varies in my experience.

par(mfrow=c(1,3))

If you do this in SAS, create a data table with data columns each representing a different µ. You can see combined histogram, box-whisker and QQ-norm, for all columns, by calling proc univariate data=Distributions plot;

run;

At what µ is skewness of the Poisson distribution small enough to be considered normal?

Exercise 4 -

Part a - Write a function that accepts a vector vec, a vector of integers, a main axis label and an x axis label. This function should 1. iterate over each element i in the vector of integers 2. produce a histogram for vec setting the number of bins in the histogram to i 3. label main and x-axis with the specified parameters. 4. label the y-axis to read Frequency, bins = and the number of bins.

Hint: You can simplify this function by using the parameter ... - see ?plot or ?hist

Part b - Test your function with the hidalgo data set (see below), using bin numbers 12, 36, and 60. You should be able to call your function with something like

plot.histograms(hidalgo.dat[,1],c(12,36,60), main="1872 Hidalgo issue",xlab= "Thickness (mm)")

to plot three different histograms of the hidalgo data set.

If you do this in SAS, write a macro that accepts a table name, a column name, a list of integers, a main axis label and an x axis label. This macro should scan over each element in the list of integers and produce a histogram for each integer value, setting the bin count to the element in the input list, and labeling main and x-axis with the specified parameters. You should label the y-axis to read Frequency, bins = and the number of bins.

Test your macro with the hidalgo data set (see below), using bin numbers 12, 36, and 60. You should be able to call your macro with something like

%plot_histograms(hidalgo, y, 12 36 60, main="1872 Hidalgo issue", xlabel="Thickness (mm)");

to plot three different histograms of the hidalgo data set.

Hint: Assume 12 36 60 resolve to a single macro parameter and use %scan. Your macro definition can look something like

%macro plot_histograms(table_name, column_name, number_of_bins, main="Main", xlabel="X Label")

Exercise 5 -

We've been working with data from Wansink and Payne, Table 1:

Reproducing part of Wansink Table 1 (see attached file)

However, in Homework 2, we also considered the value given in the text

The resulting increase of 168.8 calories (from 268.1 calories . . . to 436.9 calories . . . ) represents a 63.0% increase . . . in calories per serving.

There is a discrepancy between two values reported for calories per serving, 2006. We will use graphs to attempt to determine which value is most consistent.

First, consider the relationship between Calories per Serving and Calories per Recipe:

Calories per Serving = Calories per Recipe / Servings per Recipe

Since Servings per Recipe is effectively constant over time (12.4-13.0), we can assume the relationship between Calories per Serving and Calories per Recipe is linear,

Calories per Serving = β₀ + β₁ × Calories per Recipe

with Servings per Recipe = 1/β₁

We will fit a linear model, with Calories per Recipe as the independent variable against two sets of values for Calories per Serving, such that

Assumption 1. The value in the table (384.4) is correct.
Assumption 2. The value in the text (436.9) is correct.

We use the data:

Part a - Plot the regression. Use points to plot Assumption1 vs CaloriesPerRecipe, and Assumption2 vs CaloriesPerRecipe, on the same graph. Add lines (i.e. abline) to show the ?t from the regression. Use different colors for the two assumptions. Which of the two lines appears to best explain the data?

Part b - Produce diagnostic plots of the residuals from both linear models (in R, use residuals(Assumption1.lm)). qqnorm or box-whisker plots will probably be the most effective; there are too few points for a histogram. Use the code below to place two plots, side by side. You can produce more than one pair of plots, if you wise.

par(mfrow=c(1,2))

From these plots, which assumption is most likely correct. That is, which assumption produces a linear model that least violates assumptions of normality of the residual errors? Which assumption produces outliers in the residuals?

I've included similar data and linear models for SAS in the SAS template. If you choose SAS, you will need to modify the PROC GLM code to produce the appropriate diagnostic plots.

Attachment:- Graphs Assignment Files.rar

Reference no: EM132345133

Questions Cloud

Staff performance and provide feedback and coaching : Why it is so important to continuously monitor staff performance and provide feedback and coaching

Recommend changes to the proposed training program : What do you think of this? Is it likely that hotel staff will be able to learn how to handle unhappy customers from just listening to a presentation?

Different types of controls applied by management : There are different types of controls applied by management within the organization. However, not all controls are applicable to the hospitality industry.

Identify and discuss the nine points of security : Identify and discuss the "9 points of security". Different authors and time periods will have different viewpoints.

Create histograms QQ-norm and box-whisker plots for ELO : Graphs Assignment - Load the ncaa2018.csv data set and create histograms, QQ-norm and box-whisker plots for ELO

Compelling of interest about this particular artist-artwork : What is compelling/of interest about this particular artist/artwork. Response to a Master Artwork and b) Project Description

Articles for instance of abuse of power in a corporate : Refering 2 news articles for instance of abuse of power in a corporate, government, religious, or other organizational environment.

How important is diversity as a goal in university admission : How important is diversity as a goal in university admissions? and how should universities go about selecting diverse students?

Petition for a union or start an organizing campaign : How can you Describe the process for workers to petition for a union or start an organizing campaign?

Reviews

len2345133

7/24/2019 9:41:15 PM

Instructions: Please read instructions in the pdf file. All codes should be in the rmd template for all the 4 exercises, no exception. Provide output/results in pdf format. General Instructions - There are 5 exercises, each is worth 10 points. You are required to solve at least one exercise in R, and at least one in SAS. You are required to provide five solutions, each solution will be worth 10 points. For this exercise, you may use whatever graphics library you desire.

Write a Review

Required(*) Message

User Account

All Pages