Reference no: EM132665397
Nature of Data / Statistics for Data Science
Fill into this notebook your answers to the assignment.
Make sure to have each markdown text and R code segment in cells after each part of the question, together with executed results, so that we are able to independently verify the results. Don't leave the code out.
Question 1
Greenhouse.csv contains the photosynthetic performance of ten plants in two environments in a greenhouse (shady and sunny).
a) Plot the data in an appropriate graph or graphs to get a good visualisation of the perfor- mance.
b) Calculate the mean performance in the sun and in the shade
The hypothesis is that there is different average performance in the sunny environment.
c) Write down then the null and alternate hypotheses. Then use an appropriate statistic to calculate, at (p<0.04) significance level whether the data is consistent with the hypothesis. Can we accept the alternate hypothesis?
A new hypothesis is proposed that the performance in the sun is better.
d) Reformulate the null and alternate hypotheses, and verify again as in c) Somebody looked at the above data analysis and said it was a inefficient way to do it (they said it was "stupid"), as important information was neglected. This person was right.
e) What is this missing information? Do the analysis now, incorporating this information, with an appropriate statistic, calculate a p-value based on this statistic.
Question 2
The data for "height" is a sample from a population in country A in countryheight.csv. We want to estimate the population mean, and try to say something in general about the height distribution.
a) Calculate the sample mean and the sample median of the height variable. What does the relationship between the values of the sample mean and sample median suggest?
b) Calculate a 95% confidence interval on the population mean using bootstrapping
c) Calculate a 95% confidence interval on the population mean using the normal approxima- tion
The data scientist Jane believes the population might be consistent with a normal distribution.
d) Create an appropriate plot to test Jane's hypothesis.
e) Does the data agree with her hypothesis? Why/why not?
Jane got more height data -- this time a sample from country B. The measurements are in the variable "height2".
From previous height studies, it is believed that people in country B are, on average, taller than those in country A.
f) Formulate the null hypothesis and alternative hypothesis for this belief.
g) Use a test statistic to determine if the null hypothesis can be rejected, and calculate the p- value.
h) Can we conclude that the (population) mean height is statistically significantly different in country B to that of country A ? Justify your answer.
i) Suggest one improvement to this test to improve the quality of the possible conclusions, explaining why it would help.
Question 3
Consider the following data set of drivers who died in collision with a train, and the amount of crude oil exported from Norway to USA, for years 1999 to 2009.
(a) Plot the most appropriate graph to determine if the data is correlated
(b) Run the best test to determine linear correlation together with calculated 95% confidence intervals
(c) Can you conclude the datasets are correlated? Give an explanation for why you think it is/is not the case
(d) Perform a least-squares fit, plotting the original data points plus the appropriate line on the same graph
(e) Do you think looking at the line, that this fit is a good explanation? Please give reasons for your choice. Then, using a test given already in the course, plot a graph to demonstrate if this is indeed a good fit.
(f) Can we conclude that the number of drivers who died as a result of a train collion affects the amount of oil exported into the USA from Norway ? Explain your answer.
Attachment:- Statistics for Data Science.rar