Reference no: EM133482068
Case: Spark SQL is a module of the Apache Spark platform that provides support for working with structured data. It allows you to use a SQL-like language to query and manipulate data stored in various data sources, including structured data files, tables in relational databases, and data stored in Apache Hive.
In the context of big data analytics, Spark SQL can be used to clean, transform, and prepare data for analysis. It can also be used to perform various types of analyses on big data, such as aggregation, filtering, and joining data from different sources. This can help make the big data analytics process more efficient and effective, by providing a high-performance, scalable, and easy-to-use tool for working with structured data.
analyze nyc-tripdata.csv using Spark SQL on the Databricks platform. You will also need to use the taxi zone lookup table using taxi_zone_lookup.csv that maps the location ID into the actual name of the region in NYC. The nyc-tripdata dataset is a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driver-reported passenger counts.
Data:
nyc-tripdata.zip
taxi_zone_lookup.csv
Action Items:
Question 1. Use only Firefox, Safari or Chrome when configuring anything related to Databricks.
Question 2. Carefully follow the Databricks Setup Guide listed below. (You should have already downloaded the data needed for this question using the link provided in the Data Section)
Databricks_Setup_Guide_V1.docx
Important Notes: the cluster and tables will need to be re-created periodically. When creating the table again, use the same name. Your SQL code notebook is saved automatically. Search for it by name using the "search" button on the left panel if you cannot find it.
Question 3. Use SQL in Spark to complete the following tasks. Take a screenshot at the end of each exercise.
1) List the top-5 most popular locations for dropoff based on "DOLocationID", sorted in descending order by count. If there is a tie, then one with a lower "DOLocationID" gets listed first. Before solving problem 3 (both questions), you need to filter the data to only keep the rows where "PULocationID" and the "DOLocationID" are different and the "trip_distance" is strictly greater than 2.0 (>2.0).
Output Columbus: DOLocationID, number_of_dropoffs
2) List the top-3 locations with the maximum overall activity. Here, overall activity at a LocationID is simply the sum of all pick-ups and all drop-offs at that LocationID. In case of a tie, the lower LocationID gets listed first. Note: If a taxi picked up 3 passengers at once, we count it as 1 pickup and not 3 pickups.
Output Columbus: LocationID, number_activities
Hint: In order to get the result, you may need to perform a join operation between the two tables.
3) In the Notebook, generate appropriate bar charts using "+" and "Visualization" for the output of the above two questions.
Question 4. Summarize and reflect on what you learned throughout this lab assignment. Note that to get the points, you must mention some details.
Was there anything surprising or unexpected?
Was there anything worth investigating further?