Create an ordered treatment pairs table

Reference no: EM132335312

Assignment - Data Manipulation

There are six exercises below. You are required to provide solutions for at least four of the five. You are required to solve at least one exercise in R, and at least one in SAS. If you choose SAS for an exercise, you may use IML, DATA operations or PROC SQL at your discretion.

Reuse - For many of these exercises, you may be able to reuse functions written in prior homework. Define those functions here.

Exercise 1 -

Part a - Repeat the table from Homework 5, Exercise 2. The table will contain 30 rows, each corresponding to a unique combination of CV from 8, 12, ..., 28 and Diff from 5, 10, ..., 25. However, for this exercise you only need to calculate one column for required replicates (α = 0.05 and β = 0.2)

Define the table. Do not print this table.

Part b - Create two subset tables, one that contains the combinations of CV and Diff that require the five largest number of replicates and one the contains the combinations of CV and Diff the five smallest number of replicates. You can determine the subset by ranking or sorting by required replicates. You can add a rank column to your table if you wish. Call one table LargestFive and one table SmallestFive.

Part c - Print LargestFive sorted by required replicates in descending order, and print SmallestFive in ascending order.

Exercise 2 -

Part a - Go to Statistical Reference Datasets (StRD) website and use the data listed under Data File in Table Format. You may reuse the file from Homework 6. Load the data into a table below.

Part b - Reshape or transpose this table from the wide format to the long format. Make sure the resulting table has two columns - Resistance and Instrument.

Part c - To confirm that the table was reshaped correctly, use aggregate or tapply to calculate mean Resistance grouped by Instrument from the long table, and use apply or colMeans to calculate column means from the wide table. Print and compare the results.

Note that the reshaped table should be equivalent to the file linked under 'Data File in Two-Column Format'.

Exercise 3 -

Create an ordered treatment pairs table from the pumpkin.csv. In the submitted work print the table only once at the end of the exercise.

Part a - Read the pumpkin data and compute mean mi, standard deviation s_i and count n_i for each level i of Class.

Part b - Create a table over all possible pairs i, j of k Classes from these data. Let one table column be i and another column be j. Let i = 1, 2, . . . (k - 1) and j = i + 1, i + 2, . . . k. There will be (k × (k - 1))/2 rows in this table. I usually create an empty table, then fill the table using a pair of nested loops, the outer loop over i and the inner loop over j. Use a counter variable to keep track of the current row and increment the counter in each step of the inner loop.

Part c - Calculate Cohen's d for each Class pair in this table. Use a pooled standard deviation given by

s_pooled = √(∑_i(n_i-1)s_i²/N-k)

You may add Class means to the table if you wish. Sort the table by d in descending order and print the table.

Exercise 4 -

Part a - Download the two files from D2L ncaa2018.csv and ncaa2019.csv, and read into data frames or tables. ncaa2018.csv comes from the same source as elo.csv from Homework 5, while ncaa2019.csv is the corresponding more recent data. These tables do not contain identical sets of columns, but we will be able to merge Finish by individual wrestlers. Do not print these tables.

Part b - The tables list the wrestlers qualifying for the NCAA 2018 and 2019 National Champions, respectively. Merge the tables into a single table that contains only those wrestlers who qualified for both tournaments. Use the columns Last and First to merge on; neither is unique for all wrestlers.

The merged table should have columns corresponding to Finish 2018 and Finish 2019 - you can leave the column names as the defaults produced by R or SAS. To check the merge, print the number of rows in the table, and determine if there are any missing values in either Finish column (sum or any are sufficient. Do not print the table.

Part c - Print a contingency table comparing Weight for 2018 and Weight for 2019. The sum of all cells in this table will be equal to the total number of wrestlers that competed in both tournaments; the sum of the main diagonal will be the number of wrestlers that competed in the same weight class for both. How many wrestlers changed weight classes?

Exercise 5 -

Background - I'm working on software that produces a repeated measures analysis. To test my code, I use published data and compare results. For one analysis, I used data from Contemporary Statistical Models for the Plant and Soil Sciences, Oliver Schabenberger and Francis J. Pierce, 2001. These data are measurements of the diameter of individual apples from selected apple trees.

Part a - Download the AppleData.csv if you choose R, the SAS data is included in the SAS template. Note the file include comments for the data; you may need to specify comment character in import. Do not print this table.

To simplify this exercise, create a subset of the AppleData including only trees number 3, 7 and 10.

Part b - Reshape or transpose this data from the long form to the wide form. Call this data AppleWide. This table should have one column for Tree, one column for Apple and six columns, diam.1 - diam.6. The values in the time columns come from diam in AppleData.

Part c - To confirm that you've reshaped correctly, print column means for the wide data set and use an aggregate or apply function to compute time means for the long format.

Part d - I choose this example for a test case because it shows a case where the best repeated measures model is an auto-regressive model - each measure is correlated with the preceding measure. We can estimate the degree of using the following R code. You don't need to evaluate this code for this exercise; it's provided as a motivation for reshaping the data.

mult.lm <- lm(cbind(diam.1, diam.2, diam.3, diam.4, diam.5, diam.6) ~ tree, data=AppleWide)

mult.manova <- manova(mult.lm)

print(cov2cor(estVar(mult.lm)))

Exercise 6 -

This is an exercise in computing the Wilcoxon Signed Rank test. We will be using an example from NIST (NATR332.DAT).

The data are provided:

NATR332.DAT <- data.frame(

Y1 = c(146,141,135,142,140,143,138,137,142,136),

Y2 = c(141,143,139,139,140,141,138,140,142,138)

)

Part a - Add a column Difference that is the difference between Y1 and Y2. For further analysis, exclude any rows where the difference is 0.

Next add the column Rank, which will be the rank of the absolute value of Difference.

Part c - Add the column SignedRank by applying the sign (+ or -) of Difference, to Rank (that is, if Difference is < 0, then SignedRank is -Rank, otherwise SignedRank is Rank).

Part d - Compute the sum of the positive ranks, and the absolute value of the sum of the negative ranks. Let W be the minimum of these two sums. Print W.

The expected mean of W is calculated by

µ_W = N_r ∗ (N_r + 1)/4

with a standard deviation of

σ_W = √(N_r(N_r + 1)(2N_r + 1)/6)

where N_r is the number of ranked values (excluding differences of 0). Calculate a z score by

z_W = (W - µ_W)/σ_W

Print both µ_W and z_W.

The NIST page gives a p-values based on the continuity correction. We are not computing this correction.

You can compute the P(z > z_W) of your z_W (using the normal distribution) and compare it to

wilcox.test(NATR332.DAT$Y1, NATR332.DAT$Y2, paired = TRUE, correct = FALSE, alternative = "less")

while the corrected p-values are given by

wilcox.test(NATR332.DAT$Y1, NATR332.DAT$Y2, paired = TRUE, correct = TRUE, alternative = "less")

wilcox.test(NATR332.DAT$Y1, NATR332.DAT$Y2, paired = TRUE, correct = TRUE, alternative = "greater")

Attachment:- Data Manipulation Assignbment File.rar

Reference no: EM132335312

Questions Cloud

Why two databases are better than a general internet search : Identify two GCU Library scholarly databases that will help you find the best research articles to support your EBP proposal. Discuss why these two databases.

Based on governance and sustainability at nike : Based on governance and sustainability at Nike, What competitive advantage does this provide the leader at Nike when attempting to engage in business globally?

Is there one in particular that you believe : Describe the four major emphasis strategies discussed in Chapter 2 that are used to enhance verbal communication.

At what point does management need to become involved : How much decision making power and authority should the teams have? At what point does management need to become involved?

Create an ordered treatment pairs table : Assignment - Data Manipulation. Create an ordered treatment pairs table from the pumpkin.csv. Read the pumpkin data and compute mean mi

Representative of a specific culture : Please note that a cultural event is an event that is representative of a specific culture (cultural events can include art, music, or sporting events, among

Attempting to engage in business globally : What competitive advantage does this provide the leader at Nike when attempting to engage in business globally?

Explain how the roles of medical social workers : Explain how the roles of medical social workers have evolved over the course of the epidemic. how HIV/AIDS epidemic has changed since it first emerged in 1981

Define barriers that hold nursing practice : The Institute of Medicine has stated a goal that 90% of practice be evidence-based by 2020. According to HealthyPeople.gov, the United States is currently at.

Reviews

len2335312

7/8/2019 10:36:24 PM

There are six exercises. You are required to provide solutions for at least four of the five. You are required to solve at least one exercise in R, and at least one in SAS. You are required to provide five solutions, each solution will be worth 10 points. Thus, you may choose to provide both R and SAS solutions for a single exercise, or you may solve five of the sixth problems, mixing the languages as you wish. If you choose SAS for an exercise, you may use IML, DATA operations or PROC SQL at your discretion. Warning I will continue restricting the use of external libraries in R, particularly tidyverse libraries. You may choose to use ggplot2, but take care that the plots you produce are at least as readable as the equivalent plots in base R. You will be allowed to use whatever libraries tickle your fancy in the midterm and final projects.

Write a Review

Required(*) Message

User Account

All Pages