Reference no: EM133627537
Dataset:
We are going to use the up-to-date Olympic Event historical dataset (based on the 1908 to 2022 Olympics Games) from the olympedia.org
"Assignment 3 DataSet Part 2.zip" dataset is available in the course shell (under Assignment 3), the zip contains 4 files having information on athletes: gold, silver, bronze, and athlete_event_results please download it and by using Hadoop and Hive answer the below questions.
athlete_event_results (athlete_id, athlete, edition, sport, event, medal)
gold (athlete_id, athlete, edition, sport, event)
silver (athlete_id, athlete, edition, sport, event)
bronze (athlete_id, athlete, edition, sport, event)
Preparation
Store complete information of all athlete event results into a hive table.
Store data (gold, silver, bronze files) into a hive table that is partitioned on medal.
Show database and table structures
Questions
Question 5: Write the following queries, report results and execution time on both partitioned and complete data:
-- Substitute *table* with actual table names (you will answer the questions for all 4 tables)
a) Write a hive query to find the total number of records from the *table*
b) Write a hive query to find the total number of records by medals from *table*
c) Write a hive query to find the number of athletes who won the medals by sport from *table*
d) Write a hive query to find all athletes who won the medals between the years 1996 and 2016 from *table*
e) Provide the execution time of the below query for two tables (one for the one partitioned, one for athlete_event_results);
Select t.year, count(t.year) as count from (Select regexp_extract(edition, '(\\d{4})',1) as year from *table* where medal='Bronze') t group by year order by count desc limit 10;
f) Provide the execution time of the query (one for the one partitioned, one for athlete_event_results);
Select t.year, count(t.year) as count from (Select regexp_extract(edition, '(\\d{4})',1) as year from *table* where medal='Gold') t group by year order by count desc limit 20;