Reference no: EM133776509
Homework: Analyzing Massive Data from the Airline Industry
In this homework, you will work with data from the airline industry. You must download it and add it to your working directory to write SQL queries against it. You will write some SQL directly. On certain questions, you will need to integrate Python and run SQL queries using pandas that can be returned as a dataframe and analyzed with the tools and techniques from Python. Note that there are multiple tables in the database: flights, airports, planes, weather, airlines.
Using Jupyter notebook, answer each of the following questions. This is an individual homework. You must submit your own work. You will do your work in Jupyter notebook and must show the code you used to get the answer. To complete the homework you must download your Jupyter notebook with code and answers as an html file and upload the notebook in html format. All questions should be answered fully when a text answer is required and all code must be shown to indicate how you arrived at the result.
Your report submission should be formatted using Markdown and code in Jupyter notebook as follows:
The name of the homework as an H1, your name as an H2, and the date as an H3 all in the first cell block. Add a horizontal line/rule after the cell. The questions should then be answered in individual cells indicated by Q1, Q2, Q3...etc. in H4 format and separated from other questions by horizontal lines/rulers.
Note you must write out the SQL query used to generate the answer for all questions unless otherwise specified.
A. How many records are in each table in the database?
B. Did any flights depart and arrive exactly on time?
C. What is the mean arrival delay among flights?
D. How many flights had above average arrival delays?
E. What is the relationship between temperature and wind speed? Do changes in the mean temperature throughout the year (by month) move reflect changes in mean wind speed?
Print the first 5 rows of flights and airlines. Does the flights table contain the full name of the airline? If so, print them. If not, write a query to return the first 10 records in the flights data with the following columns: the origin, destination, carrier, flight number, and full name of the airline for each flight in a single table.
Your supervisor is interested in the airlines that offer the most flights in peak periods. Write a query that returns the number of flights each airline operates in December. The results should be a single table that contains five columns: carrier, month, name, and num_flights. The results should be limited to 10 and shown in descending order by number of flights.
Use pandas to connect to the database. Define a function that takes one argument--month_num--called num_flights_fun that uses pandas' `.read_sql_query()` and runs a query that returns the number of flights each airline operates in whatever month is specified in the function call. The result should be returned as a pandas dataframe that contains five columns: carrier, month, name, and num_flights.
Test your function to ensure it works. Create a new dataframe called sept that contains the results with only records from September and a data frame called jan that contains the results with only records from January. The only change you should make to get the results is adjusting the value of the input parameter you are passing in the function call in num_flights_fun. Print out the last five rows of sept and jan.
Using sept and jan, print the top five airlines with the greatest number of flights in September and in January, in descending order. Are the top five airlines the same in September and January? If so, what are the airlines? If not, which airlines differ from September to January?
A. Create a new function using pandas called num_flights_fun_all that runs the same query as above but returns the records for all months. The function should not take any arguments.
B. Use num_flights_fun_all to create a dataframe called months_df. The dataframe should have carrier, month, name, and num_flights columns and contain the number of flights that each airline made in each month. Show the first 15 lines of the dataframe to ensure the results are correct.
C. Use months_df to plot the number of flights each airline made for each month using altair, plotly, or seaborn.
D. Write a query that returns the each airlines' mean arrival delay across all of its flights. The result should have two columns: mean_delay and the full name of the airline. Which airline has the highest average delay?
E. Use pandas to query the data and return a dataframe to plot the data. Use either altair, seaborn, or plotly and plot the mean delay for each airline in a bar plot. The plot should have the airline name on the y and mean delay on the x and be ordered by mean delay. The bars should be colored by whether the airlines' flights arrive, on average, early or late. Which airline, on average, has the earliest arrivals?