Write the r code to identify the unique years

Assignment Help Applied Statistics

Reference no: EM132388436

Title: "Tidy and clean the height data for NLSY '79"

Introduction - The purpose of this assignment is to tidy and clean the height data for respondents to NLSY '79 for all available years. Tidying the data is necessary because a person's height is recorded with different variables for each survey year. In this form, it is cumbersome, for example, to plot trends in height for individual teens as they aged. For this dataset, "cleaning" the data will refer to identifying values that represent missing values, or values that appear to be erroneously recorded.

Your assignment is to use the data on height that you imported from your downloads to create a tibble named 'height_data_late' with the following properties.

1. 'height_data_late' has 3 variables with the same names and modes as 'height_data_early'.

2. 'height_data_late$CASEID' contains the integer codes for the survey respondents.

3. 'height_data_late$year' contains the year of the data, as an integer.

4. 'height_data_late$height' contains the height of the respondents in inches, as an integer.

5. Any values indicating missing values of height are represented as 'NA' in 'height_data_late$height'.

6. Any values of height data that you judge to not be the true heights of the respondents (outliers) are recorded as 'NA' in 'height_data_late$height'.

7. In meeting requirement 6, give an explanation of your reasoning for declaring a value or group of values as an outlier.

Step by step instructions to carry out these steps are given below.

1) Inspect the distributions of variables in early height data

Below, perform the following tasks to gain insight into the distributions of the variables. This will help guide your work on the remaining height data.

1.1 - Write the r code to identify the unique years for which data is available in 'height_data_early'.

1.2 - Restricting data to the first year for which data is available, use the 'summary' function to report the quantiles and extreme values of height, and plot a histogram of height, to gain more insight into the distribution. (You will not be graded on your choice of binwidth.)

Note on coding of missing values

Within the details on recording of data on the NLSY '79 web site is an explanation of how missing values are coded. For the height variable and others, negative numbers -5 to -1 code for values that are missing for one of 5 reasons. For our purposes, it is better to code all missing values as 'NA', as done in the 'height_data_early' dataset. You will do this in the next section.

2) Import the downloaded raw data

In part A of the project, you were asked to download as a .csv file the height data available for years between 2006 and 2014. You weere also asked to copy from a Web page a table explaining the data contained in the variables (a data dictionary).

2.1 - Import height data into R

Import the height data using 'readr::read_csv'. This should result in a tibble; called 'raw_height_df'. 'glimpse' applied to my version shows the structure reproduced in the following image. If your data download included other columns, you need to delete them, either in the .csv file (prior to importing) or using 'select' on the tibble to drop those columns. Be sure that all 11 variables listed above are present in your final tibble.

2.2 - Import height data dictionary into R

Next, import the data dictionary .csv file, saving the result as a tibble, named, 'height_dict'. When I performed this step, I recieved a tibble with the following structure. Notice that my call to 'read_csv' assigned the default column names (X1 to X4), since in my .csv file, no column names were specified. Provide descriptive column names ("Variable_code", "Q_code", "Description", "year") for 'height_dict'.

2.3 - Clean up coding of missing values in raw_height_df

Transform the raw height data so that any value they have represented as missing is coded with an 'NA'. Store the updated result back in 'raw_height_df'.

3 - Tidy the data

The data dictionary indicates that for these years, the height of an individual was recorded with two variables, one for feet and one for inches. Inspection of a few random rows in 'raw_height_df' shows that the feet variables are normally 4, 5, or 6, and the inches variables, are integers between 0 and 11. Also, the height in feet readings are coded with different variables for different years -- this data is untidy.

To transform the data into the form that matches the early height data we need to

* tidy the height in feet variables;

* tidy the height in inches variables;

* combine these to compute a single height variable;

* perform any remaining clean-up of outliers.

3.1 - Separate height in feet and height in inches

Tidying the data on both feet and inch readings will go most smoothly if we create separate tibbles for each class of measurements. After tidying the data we will re-combine (join) them into a single tibble.

Use the dictionary to identify the variable codes that correspond to height in feet readings. Create a tibble 'raw_ht_feet_df' that contains the data on the height in feet variables and the variable for CASEID (R0000100). Repeat this process to also create 'raw_ht_inches_df'.

3.2 - Tidy the height in feet variables

We now have a tibble containing exclusively the data on height in feet. Use tidyverse's 'gather' to create a tibble, 'tidy_height_feet', with all values of height in feet given in the variable 'height_feet', and the variable codes in the variable 'feet_variables'.

3.3 - Tidy the height in inches data

We have a tibble containing exclusively the data on height-inches. Use tidyverse's 'gather' to create a tibble, 'tidy_height_inches', with all values of height in inches given in the variable 'height_inches', and the variable codes in the variable 'inches_variables'.

3.4 - Associate height in feet records with a year

In 'tidy_height_feet', for each individual there are five records, one corresponding to each variable. Each variable represents height in feet for a particular year. In order to associate height in feet and height in inches readings we need to identify the correct year for each. In this step, use the dictionary tibble to add a variable 'year' to 'tidy_height_feet', creating 'tidy_height_feet2' that identifies the year corresponding to each variable. Then, remove the 'feet_variables' variable from 'tidy_height_feet2'. 'tidy_height_feet2' should have the columns 'R0000100', 'height_feet' and 'year'.

3.5 - Associate height in inches records with a year

Repeat the preceding work for 'tidy_height_inches' to produce 'tidy_height_inches2' that has each record associated with an individual and a year. 'tidy_height_inches' should also have 3 variables, 'R0000100', 'height_inches', 'year'.

3.6 - Join feet and inch data

Now, you have tidy data frames for the height-feet readings and the height-inches readings for each individual and each survey year. Join these into a single tibble called 'tidy_height_data'. It should have the columns, 'R0000100', 'height_feet', 'height_inches', 'year'.

4 - Identify outliers and suspect values

Take a breath and congratulate yourself. Having the data in this tidy form is a big accomplishment. From here it's largely a matter of cleaning up outliers. Also, we need to calculate height from the feet and inches components, but that's simple, if the data doesn't have outliers.

Explore the height and inch data for unexpected values

Our search for outliers reduces to inspecting the feet and inch variables for unexpected values. We already replaced negative numbers by 'NA'. The remaining values are intended to represent real data; we need to judge if they do.

4.1 - Inspect values of the height_feet variable

For inspecting these variables we don't need to separate by year, but just look for weird values anywhere. For the feet variable, there should only be 3 or 4 non-NA values.

For a detailed inspection, compute and print out all unique values of 'height_feet'.

4.2 - Inspect values of the height_inches variable

The 'height_inches' variable should have values 0 to 11. As with feet, compute and print out all unique values of the variable.

5 - Isolation and clean-up of the suspect values

You should have discovered in the preceding sections that there are some values of these variables that don't match our expectations. For some of unexpected values it still may be possible to compute the final height variable. Others, we'll judge to be erroneously recorded, and replace with 'NA'. In order to decide that, you should inspect the full records that contain suspect values.

In this section you should:

5.1 - Define clear criteria

Define clear criteria that identify values of 'height_feet' and 'height_inches' as suspect. Give brief explanations for your decisions, no R code requried.

5.2 - Apply criteria to create two new confident and suspect tibbles

Apply your criteria to filter 'tidy_height_data' into two new tibbles: 'suspect_height_data', containing the records with suspect values, and 'confident_height_data' containing the values that aren't suspect. Include in 'confident_height_data' the records for which both 'height_feet' and 'height_inches' are 'NA' (since we are confident we can compute the right value of height for these -- it's 'NA'.)

5.3 - Define the height for confident records

For the records that you are confident are legitimate use the existing variables to add to 'confident_height_data' a new variable, 'height', which represents the full height of a person in inches. Call the new tibble 'height_data_late_part1'. Further transform 'height_data_late_part1' so that the variables are 'CASEID', 'year' and 'height', and each are integer vectors, as in 'height_data_early'.

5.4 - Make a decision on suspect records

Your selection process should have identified a moderate number of suspect records. Here, I'd like you inspect these records, and identify any for which you believe you can confidently calculate the height. Write at least 2 criteria for identifying these records and computing the height for these records. Records that do not satisfy these criteria are considered ERRONEOUS. For the latter, the 'height' variable will be 'NA'.

To carry out this inspection, it's helpful to view the entire 'suspect_height_data' tibble. I suggest clicking on this tibble in RStudio's Environment pane to load this into the spreadsheet-like viewer in RStudio's source pane.

Note: It is not 100% clear what the criteria should be. There is one pattern of values that some may say are suspect and others may believe are legitimate. Just state clearly what your criteria are and your reasoning behind them.

5.5 - Act on your criteria to define height for some suspect records

Your goal for this step is to create the tibble 'height_data_late_part2', which has the same variables as 'height_data_late_part1', with 'height' defined, possibly as 'NA', for each record in 'suspect_height_data'.

To carry this out, further filter 'suspect_height_data' into separate tibbles containing the records satisfying each of the criteria defined above for determining the height. Name each tibble df1, df2, etc, to coorispond with your criteria. Apply your criteria to each tibble to define height for these records. After you also select and rename the variables in the tibbles to match those desired, create a tibble 'height_data_late_part2' containing all records in one of the new tibbles. To carry out this last step, use the 'bind_rows' function.

[To combine tibbles 'df1' and 'df2' with 'bind_rows', execute 'bind_rows(df1, df2)'.]

5.6 - Combine the two parts of the late height data

You defined height data for two classes of records, giving 'height_data_late_part1' and 'height_data_late_part2'. Bind these different tibbles into the single tibble, 'height_data_late'. Check that 'height_data_late' has the desired variables, variable names and variable modes.

6.1 - Explore the final heights

The last year of the survey is 2014. Plot a histogram of the height data you computed for 2014, with binwidth of your choice. The histogram should be similar to the one you plotted for the early data.

Attachment:- Assignment Files.rar

Reference no: EM132388436

Questions Cloud

Discuss the idea of being called to a purpose : Discuss the idea of "being called to a purpose" and its relevance to the transformation of healthcare. 250 words minimum initial post, 100 words minimum reply.

Prepare a critical review of fake news : Prepare a critical review of fake news as it applies to marketing by synthesising readings - develop a marketing strategy and activation plan to overcome

Define impact of the case on the law and powerbrokers : The impact of the case on the law and powerbrokers. Prepare this assignment according to the guidelines found in the APA Style Guide, located in the Student.

Infotech in global economy : Discussed the importance of stakeholder engagement in policy making. what measures would you take to engage stakeholders in that project?

Write the r code to identify the unique years : Title: "Tidy and clean the height data for NLSY '79" Write the r code to identify the unique years for which data is available in 'height_data_early'

Explain the cause of airline irregular operations : Choose one specific cause of airline irregular operations, and write a short essay (200-300 words) describing the ways the issue can affect the operation.

Calculate sum of squares for a and b : Each question below comprises a single step in an independent groups t-test. For each of the given data sets, calculate sum of squares for A and B.

Did airline do anything to mitigate the impact on passengers : Once involved in the delay, did the airline do anything to mitigate the impact on the passengers? Would greater use of NextGen technologies.

Identify assets and activities to be protected : Identify assets and Activities to be protected. Identity threats, vulnerabilities and exploits. Explanation of the IT network

Reviews

len2388436

10/17/2019 2:54:41 AM

Assignment needs to be done in tidyverse. Please find here attached the zip file with the instructions for the assignment (R_Assignment). The assignment requires some Rmd and csv files that are included in Project1data.zip and height_index.csv.

Write a Review

Required(*) Message

User Account

All Pages