Reference no: EM133048306
CIS4066-N Statistical Methods for Data Analytics - Teesside University
Assignment - Statistical Methods for Data Analytics ICA
Learning outcomes
Personal and Transferable Skills
1. Use own judgement to select a valid statistical method in the context of the research project.
2. Manage complex data within the application of data analysis software in order to run tests.
Research, Knowledge and Cognitive Skills
3. Analyse complex related and unrelated data sets using a range of methods.
4. Critically analyse and interpret outcomes of statistical tests in order to identify patterns and significance levels.
5. Critically appraise the validity and reliability of the methods available.
Professional Skills
6. Demonstrate an ethical understanding of data analysis and its effect in a wider social context.
Qualitative statistics: thematic analysis
Using the Guardian website, select an article with at least 20 comments. Select a subset of comments (around 20) for the following analysis.
Question 1. Provide a link to the article you have selected. List the comments you have chosen, the key themes you identify and the number of times they occur from the comments.
Question 2. Describe the themes you identified and the key features that make the theme recognisable (words, metaphors, emotional language, etc.). Discuss your findings
- what conclusions can you make?
Probability and statistics fundamentals
Question 3. Given the event E: "The Covid-19 emergency will finish in 2022", define the following events F1 and F2, both theoretically and giving concrete examples of what they might be:
- an event F1 such that E and F1 are independent
- an event F2 such that E and F2 are dependent
Question 4. Derive with mathematical steps the final formula of Bayes' theorem.
Describe in general terms in what cases the theorem is useful, and propose a concrete scenario where it can be used.
Question 5. Discuss the different scales of measurements. Provide two examples for each scale.
Question 6. There are four medals (Gold, Silver, Bronze and Wood) on a table, but they are all wrapped with dark wrapping paper, such that it is impossible to distinguish them. You would like to find the gold medal.
The game starts as follows. You pick one medal without unwrapping it, and then the game host unwraps one of the remaining medals and reveals that it is a silver medal. (Assume here that the host unwraps a medal with equal probability, but knowing where the gold medal was and avoiding unwrapping the gold medal if still on the table, to keep the game interesting to watch until the end.)
You have now three medals left to unwrap (one in your hand, two on the table). At this point, the host gives you the option to change your mind and swap your medal for one of the two left on the table.
What would you do at this point? Would you keep your medal, or swap it with one of the two medals left on the table? If so, which one?
Hints: Find the solution by using Bayes' theorem, calculating all the conditional probabilities involved. Start calculating the probability of having Gold in our hands given that we know that the host unwraps Silver, P(G|Hs) = . . ..
Then compare with the probability of having Bronze or Wood in our hands given that we know that the host unwraps Silver, P(B|Hs) = . . .., P(W|Hs) = . . ..
Question 7. In June 2021, during the vaccine rollout for the Covid-19 emergency, it was estimated that 90% of the population over 50 years old were fully vaccinated, while only 6% were completely unvaccinated. (The remaining 4% had only one dose or had an unknown vaccination status, and therefore will not be considered here.)
A Public Health England report on cases and hospitalisation from the "delta" variant (originally sequenced in India) was published at the end of June 2021. The report showed that, between February and June 2021, among the 418 people admitted to the hospital with the "delta" variant:
- 163 were fully vaccinated
- 136 were not vaccinated
- The remaining people had only one dose or an unknown vaccination status and will not be considered here
One may therefore wrongly conclude that a fully vaccinated person is surprisingly more likely to be hospitalised than an unvaccinated person. Using Bayes' theorem to calculate the relevant probabilities from the data above, prove that this claim is wrong. Show that this data actually proves that vaccines are extremely effective at reducing the risk of hospitalisation after contracting the "delta" variant.
Central tendency and variability
Question 8. The 2010 salaries of the White House staff are provided in the table "2010_White_House_Staff.xlsx"
Perform a pipeline of descriptive statistical methods in R, including central tendency and variability measures, to describe, interpret and discuss the dataset.
Statistical tests
Question 9. A variable X follows a normal distribution with mean 1.5 and standard deviation 2. Calculate the probability P(X < 0).
Question 10. You would like to test whether a herb works for the treatment of insomnia. 100 people volunteered to take part in the study.
- Design how you would carry out the experiment, what tasks the participants should perform, and define what could be a null and alternative hypothesis in this case.
- Describe what could be an error of type I or type II, how they are defined theoretically and what they represent in this case.
Question 11. A company has produced a batch of 1000 CPUs whose clock speeds follow a normal distribution centred around 2.1 GHz, with a standard deviation of 0.4 GHz. The company is trying new approaches to manufacture CPUs, therefore 20 of these CPUs were produced with an additional new experimental feature. The clock speed of these 20 experimental CPUs is as follows (in GHz):
2.6, 1.9, 2.9, 2.3, 1.5,
1.9, 1.9, 1.8, 2.5, 2.1,
2.3, 1.7, 1.8, 2.4, 2.2,
1.9, 2.9, 3.3, 1.8, 2.1.
Design and perform a statistical test to check if the difference in clock speed for this new experimental technology is statistically significant, or the difference is just due to chance.
Hints: use a one-sample t-test, specifying the assumptions and finding the p- value. What additional assumption would be needed to use the z-test instead?
Regression
Question 12. Using the file boxOffice.csv, perform a logistic regression analysis to find out whether the budget spent on a movie affects its chances of winning an Oscar. Discuss the general role of logistic regression, the logit scores, z-values and p- values from the summary of the model output. Can we reject the null hypothesis?
Question 13. What needs to change in the overall problem if one wants to use linear regression?
Question 14. What other methods could one try if linear regression does not perform well?
Attachment:- Statistical Methods for Data Analytics ICA.rar