Reference no: EM132448421
Lab - Data Cleaning/Preparation and Visualization
Stats 10: Introduction to Statistical Reasoning
Objectives -
1. Understand logical statements and subsetting.
2. Reinforce knowledge on visualization techniques.
Exercise 1 - We will be working with lead and copper data obtained from the residents of Flint, Michigan from January-February, 2017. Data are reported in PPB (parts per billion, or µg/L) from each residential testing kit. Remember that "Pb" denotes lead, and "Cu" denotes copper.
a. Download the data from CCLE and read it into R. When you read in the data, name your object "flint".
b. The EPA states a water source is especially dangerous if the lead level is 15 PPB or greater. What proportion of the locations tested were found to have dangerous lead levels?
c. Report the mean copper level for only test sites in the North region.
d. Report the mean copper level for only test sites with dangerous lead levels (at least 15 PPB).
e. Report the mean lead and copper levels.
f. Create a box plot with a good title for the lead levels.
g. Based on what you see in part (f), does the mean seem to be a good measure of center for the data? Report a more useful statistic for this data.
Exercise 2 - The data here represent life expectancies (Life) and per capita income (Income) in 1974 dollars for 101 countries in the early 1970's. The source of these data is: Leinhardt and Wasserman (1979), New York Times (September, 28, 1975, p. E-3). They also appear on Regression Analysis by Ashish Sen and Muni Srivastava. You can access these data in R using: life <- read.table
a. Construct a scatterplot of Life against Income. Note: Income should be on the horizontal axis. How does income appear to affect life expectancy?
b. Construct the boxplot and histogram of Income. Are there any outliers?
c. Split the data set into two parts: One for which the Income is strictly below $1000, and one for which the Income is at least $1000. Come up with your own names for these two objects.
d. Use the data for which the Income is below $1000. Plot Life against Income and compute the correlation coefficient. Hint: use the function cor().
Exercise 3 - Use R to access the Maas river data. These data contain the concentration of lead and zinc in ppm at 155 locations at the banks of the Maas river in the Netherlands. You can read the data in R as follows: maas <- read.table
a. Compute the summary statistics for lead and zinc using the summary() function.
b. Plot two histograms: one of lead and one of log(lead).
c. Plot log(lead) against log(zinc). What do you observe?
d. The level of risk for surface soil based on lead concentration in ppm is given on the table below:
Mean concentration (ppm) - Level of risk
Below 150 - Lead-free
Between 150-400 - Lead-safe
Above 400 - Significant environmental lead hazard
Use techniques similar to last lab to give different colors and sizes to the lead concentration at these 155 locations. You do not need to use the maps package create a map of the area. Just plot the points without a map.
Exercise 4 - The data for this exercise represent approximately the centers (given by longitude and latitude) of each one of the City of Los Angeles neighborhoods. See also the Los Angeles Times project on the City of Los Angeles neighborhoods. You can access these data at: LA <- read.table
a. Plot the data point locations. Use good formatting for the axes and title. Then add the outline of LA County by typing: map("county", "california", add = TRUE)
b. Do you see any relationship between income and school performance? Hint: Plot the variable Schools against the variable Income and describe what you see. Ignore the data points on the plot for which Schools = 0. Use what you learned about subsetting with logical statements to first create the objects you need for the scatter plot. Then, create the scatter plot. Alternate methods may only receive half credit.
Attachment:- Data CleaningPreparation and Visualization Assignment File.rar