Reference no: EM132335312
Assignment - Data Manipulation
There are six exercises below. You are required to provide solutions for at least four of the five. You are required to solve at least one exercise in R, and at least one in SAS. If you choose SAS for an exercise, you may use IML, DATA operations or PROC SQL at your discretion.
Reuse - For many of these exercises, you may be able to reuse functions written in prior homework. Define those functions here.
Exercise 1 -
Part a - Repeat the table from Homework 5, Exercise 2. The table will contain 30 rows, each corresponding to a unique combination of CV from 8, 12, ..., 28 and Diff from 5, 10, ..., 25. However, for this exercise you only need to calculate one column for required replicates (α = 0.05 and β = 0.2)
Define the table. Do not print this table.
Part b - Create two subset tables, one that contains the combinations of CV and Diff that require the five largest number of replicates and one the contains the combinations of CV and Diff the five smallest number of replicates. You can determine the subset by ranking or sorting by required replicates. You can add a rank column to your table if you wish. Call one table LargestFive and one table SmallestFive.
Part c - Print LargestFive sorted by required replicates in descending order, and print SmallestFive in ascending order.
Exercise 2 -
Part a - Go to Statistical Reference Datasets (StRD) website and use the data listed under Data File in Table Format. You may reuse the file from Homework 6. Load the data into a table below.
Part b - Reshape or transpose this table from the wide format to the long format. Make sure the resulting table has two columns - Resistance and Instrument.
Part c - To confirm that the table was reshaped correctly, use aggregate or tapply to calculate mean Resistance grouped by Instrument from the long table, and use apply or colMeans to calculate column means from the wide table. Print and compare the results.
Note that the reshaped table should be equivalent to the file linked under 'Data File in Two-Column Format'.
Exercise 3 -
Create an ordered treatment pairs table from the pumpkin.csv. In the submitted work print the table only once at the end of the exercise.
Part a - Read the pumpkin data and compute mean mi, standard deviation si and count ni for each level i of Class.
Part b - Create a table over all possible pairs i, j of k Classes from these data. Let one table column be i and another column be j. Let i = 1, 2, . . . (k - 1) and j = i + 1, i + 2, . . . k. There will be (k × (k - 1))/2 rows in this table. I usually create an empty table, then fill the table using a pair of nested loops, the outer loop over i and the inner loop over j. Use a counter variable to keep track of the current row and increment the counter in each step of the inner loop.
Part c - Calculate Cohen's d for each Class pair in this table. Use a pooled standard deviation given by
spooled = √(∑i(ni-1)si2/N-k)
You may add Class means to the table if you wish. Sort the table by d in descending order and print the table.
Exercise 4 -
Part a - Download the two files from D2L ncaa2018.csv and ncaa2019.csv, and read into data frames or tables. ncaa2018.csv comes from the same source as elo.csv from Homework 5, while ncaa2019.csv is the corresponding more recent data. These tables do not contain identical sets of columns, but we will be able to merge Finish by individual wrestlers. Do not print these tables.
Part b - The tables list the wrestlers qualifying for the NCAA 2018 and 2019 National Champions, respectively. Merge the tables into a single table that contains only those wrestlers who qualified for both tournaments. Use the columns Last and First to merge on; neither is unique for all wrestlers.
The merged table should have columns corresponding to Finish 2018 and Finish 2019 - you can leave the column names as the defaults produced by R or SAS. To check the merge, print the number of rows in the table, and determine if there are any missing values in either Finish column (sum or any are sufficient. Do not print the table.
Part c - Print a contingency table comparing Weight for 2018 and Weight for 2019. The sum of all cells in this table will be equal to the total number of wrestlers that competed in both tournaments; the sum of the main diagonal will be the number of wrestlers that competed in the same weight class for both. How many wrestlers changed weight classes?
Exercise 5 -
Background - I'm working on software that produces a repeated measures analysis. To test my code, I use published data and compare results. For one analysis, I used data from Contemporary Statistical Models for the Plant and Soil Sciences, Oliver Schabenberger and Francis J. Pierce, 2001. These data are measurements of the diameter of individual apples from selected apple trees.
Part a - Download the AppleData.csv if you choose R, the SAS data is included in the SAS template. Note the file include comments for the data; you may need to specify comment character in import. Do not print this table.
To simplify this exercise, create a subset of the AppleData including only trees number 3, 7 and 10.
Part b - Reshape or transpose this data from the long form to the wide form. Call this data AppleWide. This table should have one column for Tree, one column for Apple and six columns, diam.1 - diam.6. The values in the time columns come from diam in AppleData.
Part c - To confirm that you've reshaped correctly, print column means for the wide data set and use an aggregate or apply function to compute time means for the long format.
Part d - I choose this example for a test case because it shows a case where the best repeated measures model is an auto-regressive model - each measure is correlated with the preceding measure. We can estimate the degree of using the following R code. You don't need to evaluate this code for this exercise; it's provided as a motivation for reshaping the data.
mult.lm <- lm(cbind(diam.1, diam.2, diam.3, diam.4, diam.5, diam.6) ~ tree, data=AppleWide)
mult.manova <- manova(mult.lm)
print(cov2cor(estVar(mult.lm)))
Exercise 6 -
This is an exercise in computing the Wilcoxon Signed Rank test. We will be using an example from NIST (NATR332.DAT).
The data are provided:
NATR332.DAT <- data.frame(
Y1 = c(146,141,135,142,140,143,138,137,142,136),
Y2 = c(141,143,139,139,140,141,138,140,142,138)
)
Part a - Add a column Difference that is the difference between Y1 and Y2. For further analysis, exclude any rows where the difference is 0.
Next add the column Rank, which will be the rank of the absolute value of Difference.
Part c - Add the column SignedRank by applying the sign (+ or -) of Difference, to Rank (that is, if Difference is < 0, then SignedRank is -Rank, otherwise SignedRank is Rank).
Part d - Compute the sum of the positive ranks, and the absolute value of the sum of the negative ranks. Let W be the minimum of these two sums. Print W.
The expected mean of W is calculated by
µW = Nr ∗ (Nr + 1)/4
with a standard deviation of
σW = √(Nr(Nr + 1)(2Nr + 1)/6)
where Nr is the number of ranked values (excluding differences of 0). Calculate a z score by
zW = (W - µW)/σW
Print both µW and zW.
The NIST page gives a p-values based on the continuity correction. We are not computing this correction.
You can compute the P(z > zW) of your zW (using the normal distribution) and compare it to
wilcox.test(NATR332.DAT$Y1, NATR332.DAT$Y2, paired = TRUE, correct = FALSE, alternative = "less")
while the corrected p-values are given by
wilcox.test(NATR332.DAT$Y1, NATR332.DAT$Y2, paired = TRUE, correct = TRUE, alternative = "less")
wilcox.test(NATR332.DAT$Y1, NATR332.DAT$Y2, paired = TRUE, correct = TRUE, alternative = "greater")
Attachment:- Data Manipulation Assignbment File.rar