Reference no: EM131458
Part A
Answer all of these questions using R Commander. Include all graphs, outputs from R Commander .
This assignment uses the dataset called "Anscombe". This contains data on four different variables for different states in the USA.
Data in attached packages -> Anscombe
The dataset contains:
education - Per-capita education expenditure.
income - Per-capita income.
young - Number of under 18 year olds per 1000 people.
urban - Number of urban dwellers per 1000 people for each state in the USA.
Of the four variables, you will need investigate two of them.
To randomize which variables each student will be using, x = the last digit of your student ID
If x = 0 - 1, use education and income
If x = 2 - 3, use education and young
If x = 4 - 5, use education and urban
If x = 6 - 7, use income and young
If x = 8, use income and urban
If x = 9, use young and urban
- Write a brief description of the data you're working with.
- Produce a histogram of both variables, and describe both distributions.
- Using R Commander, test to see which of the two variables is most normal. Justify your choice.
- For the most normal of your two variables, assume the data is taken from a normally distributed population to answer the following questions:
a. Find the mean and standard deviation.
b. Find the 80th percentile. Explain in words what this value means.
c. What is the probability of a randomly selected member of the population being within 17 units of the mean? Use R Commander to plot the normal distribution and use vertical lines to show the region of interest.
d. What is the z score of the point 33 units above the mean? Interpret the meaning of this value.
- Make a scatterplot showing the relationship between the two variables. Describe the relationship. Relate this back to the meaning of the data.
- Produce r, the correlation coefficient. Explain in words what this value shows?
- Find the other correlation coefficients for the relationships between the other variables in Anscombe. Compare and contrast with the correlation coefficient found for your data, and relate these back to the meaning of the data.
- Produce r2. Explain in words what this value shows?
Part B
To gather the above information in each state, surveys were done in each state. 4 random districts were selected from a list of all of the districts in the state, and the phone directories were used to select and survey 1000 random people from each district.
- What is the name for this kind of sampling?
- What were the sampling frames used here? Could better sampling frames have been used? If you answered "yes", suggest an example of a better sampling frame.
- What is the difference between a sample and a census? Why do you think a census was not used in this instance?
- List three possible sources of sampling bias for this sampling method. Justify each one.