Was there anything surprising or unexpected

Assignment Help Computer Engineering

Reference no: EM133482068

Case: Spark SQL is a module of the Apache Spark platform that provides support for working with structured data. It allows you to use a SQL-like language to query and manipulate data stored in various data sources, including structured data files, tables in relational databases, and data stored in Apache Hive.

In the context of big data analytics, Spark SQL can be used to clean, transform, and prepare data for analysis. It can also be used to perform various types of analyses on big data, such as aggregation, filtering, and joining data from different sources. This can help make the big data analytics process more efficient and effective, by providing a high-performance, scalable, and easy-to-use tool for working with structured data.

analyze nyc-tripdata.csv using Spark SQL on the Databricks platform. You will also need to use the taxi zone lookup table using taxi_zone_lookup.csv that maps the location ID into the actual name of the region in NYC. The nyc-tripdata dataset is a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driver-reported passenger counts.

Data:

nyc-tripdata.zip

taxi_zone_lookup.csv

Action Items:

Question 1. Use only Firefox, Safari or Chrome when configuring anything related to Databricks.

Question 2. Carefully follow the Databricks Setup Guide listed below. (You should have already downloaded the data needed for this question using the link provided in the Data Section)

Databricks_Setup_Guide_V1.docx

Important Notes: the cluster and tables will need to be re-created periodically. When creating the table again, use the same name. Your SQL code notebook is saved automatically. Search for it by name using the "search" button on the left panel if you cannot find it.

Question 3. Use SQL in Spark to complete the following tasks. Take a screenshot at the end of each exercise.

1) List the top-5 most popular locations for dropoff based on "DOLocationID", sorted in descending order by count. If there is a tie, then one with a lower "DOLocationID" gets listed first. Before solving problem 3 (both questions), you need to filter the data to only keep the rows where "PULocationID" and the "DOLocationID" are different and the "trip_distance" is strictly greater than 2.0 (>2.0).

Output Columbus: DOLocationID, number_of_dropoffs

2) List the top-3 locations with the maximum overall activity. Here, overall activity at a LocationID is simply the sum of all pick-ups and all drop-offs at that LocationID. In case of a tie, the lower LocationID gets listed first. Note: If a taxi picked up 3 passengers at once, we count it as 1 pickup and not 3 pickups.

Output Columbus: LocationID, number_activities

Hint: In order to get the result, you may need to perform a join operation between the two tables.

3) In the Notebook, generate appropriate bar charts using "+" and "Visualization" for the output of the above two questions.

Question 4. Summarize and reflect on what you learned throughout this lab assignment. Note that to get the points, you must mention some details.

Was there anything surprising or unexpected?

Was there anything worth investigating further?

Reference no: EM133482068

Questions Cloud

Discuss the thoughts and opinions on the relationship : Using the video C.J. loses her composure over a funny name. Discuss the thoughts and opinions on the relationship between gender and organizational power.

Discuss the patient protection and affordable care act : The Patient Protection and Affordable Care Act (ACA), colloquially called 'Obama Care' was signed into law in 2010 by the sitting president, President Obama.

What are the characteristics of a digital citizen : What are the characteristics of a digital citizen and What risks are related to digital technology use, and how can you avoid them

What is one example of a pig command : What is one example of a pig command? What are the process steps to use HDFS, and Pig? What symbol ends each Pig command? Does loading the data take time? Why?

Was there anything surprising or unexpected : Was there anything surprising or unexpected and Was there anything worth investigating further?

How do moral values help in market competition : How does ethics help in the establishment of anti-competitive practices? Explain. How do moral values help in market competition? Explain.

Summarise and justify the equipment you recommend : Summarise and justify the equipment (both hardware and software) you recommend for this new lab that will meet future requirements

Determine communication and awareness strategies : Determine communication and awareness strategies. Establish compliance verification activities and implementation of complaint mechanisms.

How would you define data science : Why has data science become so crucial to businesses and other organizations and What factors have made big data available and accessible, even to smaller

User Account

All Pages