Reference no: EM133685008
Data Wrangling and Visualization Assignment
Section A - Academy Award winners
Section B - Rogue waves
For Section A you may use any software of your choice. For Section B you must use SAS. There are two datasets (one for each Section), which are available on the Assessment page:
A1A Academy awards.xls
A1B Rogue waves.csv
The marks are awarded for (note, not all questions require all three components):
[A] Answer / output.
[C] Code that you used to generate output.
[E] Justification / explanation / discussion.
It is assumed that you have read Modules 1, 2, and 3 (sections 1-6), and worked through the corresponding examples and exercises.
If you need to clarify the wording of any of the questions or if you have technical issues you may post in the Discussion Forum on Canvas, but your final submission must consist of your own work in accordance with the Academic Integrity Policy.
Section A - Academy Award winners
The Academy Awards, or "Oscars", are international awards given to meritorious achievement in the film industry. The dataset ‘A1A Academy awards.xls' contains demographic data on award winners up until the year 2014 and was compiled by kaggle user fmejia21. In this section you will investigate the demographics of Oscar winners. You may use any software of your choice.
[A]
Complete the complementary Quiz.
[A|E]
Investigate the answer to ONE of the following questions using appropriate tables and/or charts. Write a summary of your findings (approximately 100-200 words).
Age - Calculate an estimate of an award winner's age from their date of birth and year of award. What is the average age and age range of winners? Are there any age differences across minority groups or award categories?
Birthplace - Categorise the place of birth for winners as born in USA or born overseas. What proportion of winners were not born in the USA? Is there any difference in the proportion of winners born in the USA across minority groups or award categories?
Your answer will be assessed on the following elements:
Describes results/findings that answer the question being asked.
Statements are supported by relevant tables or charts as evidence from the data.
Refers to specific quantities (counts, percentages, other statistics) as part of written answer.
Communicates clearly regarding filtered/grouped data or categories when summarising data or making comparisons.
Writes with clarity and organisation using report-style language.
Section B - Rogue waves
A ‘rogue' or ‘freak' wave is an abnormally large wave relative to the conditions. They are surface waves occurring due to gravity and are not to be confused with tsunamis which are caused by sudden impacts or shifting of the sea floor. The existence of rogue waves was confirmed in 1995 by the measurement of the Draupner wave off the coast of Norway. There have since been few empirical studies of rogue waves. One study [1] analysed six years of time series data from the South Indian Ocean offshore from South Africa over 1998-2003 and identified over 1500 potential rogue waves, 15 of which were unexpectedly large. The authors hypothesise that these outliers may be actual wave measurements rather than errors.
Due to the growing evidence for rogue waves, the "linear" model is suspected to be insufficient for predicting the likelihood of particularly large rogue waves. The model posits a linear relationship between the maximum wave height in a wave series, hmax, and the significant wave height, hs, (which is the average of the top third highest waves in a series) [2]. The study defines "typical" rogue waves as those whose ratio hmax/hs is greater than 2 and less than 4 and "uncommon" rogue waves as those with hmax/hs > 4.
For Section B of this assignment, you will analyse the dataset "A1B Rogue waves.csv". The dataset contains measurements from buoys off the coast of Mooloolaba, Queensland, Australia over the period 2017-2019 and was sourced from the Queensland Government website [2]. A time series of waves is measured every half hour. This dataset contains the processed data including variables for hmax and hs.
In this section you must use SAS to clean and analyse the dataset.
Liu, P.C. and Machutchon, K.R., Are there different kinds of rogue waves?, Proceedings of OMAE2006 25th International Conference on Offshore Mechanics and Arctic Engineering, June 4-9, 2006, hamburg, Germany
Are there any range errors for the numeric variables? Explain why/why not and explain how you would deal with them.
Remove any erroneous values from the variables hmax and hs. Show a table of appropriate descriptive statistics for these variables.
Generate a new variable for the ratio of hmax/hs. Show the histogram and describe its distribution.
Categorise ratio according to the classification of rogue waves:
Show the frequency table (make sure you use a custom format to label the categories of your new variable). What percentage of these observations are rogue waves?
Create a scatterplot of hmax vs hs and show a reference line that indicates the cutoff for rogue waves (ratio of 2:1). Make sure your graph and axes have titles.
The Raleigh distribution approximately describes the frequency distribution of wave height. For a given ratio of hmax/hs, the expected frequency is 1 in every ?? waves, where
According to this formula, what is the expected frequency of the wave in the Mooloolaba dataset that has the highest ratio?
Create a scatterplot showing the line of best fit. What is its slope? Explain your workings.
hint: you can either estimate the no-intercept linear model (e.g., look up "PROC REG") or approximate it with mean(hmax)/mean(hs).