Was there anything surprising or unexpected

Assignment Help Computer Engineering
Reference no: EM133482068

Case: Spark SQL is a module of the Apache Spark platform that provides support for working with structured data. It allows you to use a SQL-like language to query and manipulate data stored in various data sources, including structured data files, tables in relational databases, and data stored in Apache Hive.

In the context of big data analytics, Spark SQL can be used to clean, transform, and prepare data for analysis. It can also be used to perform various types of analyses on big data, such as aggregation, filtering, and joining data from different sources. This can help make the big data analytics process more efficient and effective, by providing a high-performance, scalable, and easy-to-use tool for working with structured data.

analyze nyc-tripdata.csv using Spark SQL on the Databricks platform. You will also need to use the taxi zone lookup table using taxi_zone_lookup.csv that maps the location ID into the actual name of the region in NYC. The nyc-tripdata dataset is a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driver-reported passenger counts.

Data:

nyc-tripdata.zip

taxi_zone_lookup.csv

Action Items:

Question 1. Use only Firefox, Safari or Chrome when configuring anything related to Databricks.

Question 2. Carefully follow the Databricks Setup Guide listed below. (You should have already downloaded the data needed for this question using the link provided in the Data Section)

Databricks_Setup_Guide_V1.docx

Important Notes: the cluster and tables will need to be re-created periodically. When creating the table again, use the same name. Your SQL code notebook is saved automatically. Search for it by name using the "search" button on the left panel if you cannot find it.

Question 3. Use SQL in Spark to complete the following tasks. Take a screenshot at the end of each exercise.

1) List the top-5 most popular locations for dropoff based on "DOLocationID", sorted in descending order by count. If there is a tie, then one with a lower "DOLocationID" gets listed first. Before solving problem 3 (both questions), you need to filter the data to only keep the rows where "PULocationID" and the "DOLocationID" are different and the "trip_distance" is strictly greater than 2.0 (>2.0).

Output Columbus: DOLocationID, number_of_dropoffs

2) List the top-3 locations with the maximum overall activity. Here, overall activity at a LocationID is simply the sum of all pick-ups and all drop-offs at that LocationID. In case of a tie, the lower LocationID gets listed first. Note: If a taxi picked up 3 passengers at once, we count it as 1 pickup and not 3 pickups.

Output Columbus: LocationID, number_activities

Hint: In order to get the result, you may need to perform a join operation between the two tables.

3) In the Notebook, generate appropriate bar charts using "+" and "Visualization" for the output of the above two questions.

Question 4. Summarize and reflect on what you learned throughout this lab assignment. Note that to get the points, you must mention some details.

Was there anything surprising or unexpected?

Was there anything worth investigating further?

Reference no: EM133482068

Questions Cloud

Discuss the thoughts and opinions on the relationship : Using the video C.J. loses her composure over a funny name. Discuss the thoughts and opinions on the relationship between gender and organizational power.
Discuss the patient protection and affordable care act : The Patient Protection and Affordable Care Act (ACA), colloquially called 'Obama Care' was signed into law in 2010 by the sitting president, President Obama.
What are the characteristics of a digital citizen : What are the characteristics of a digital citizen and What risks are related to digital technology use, and how can you avoid them
What is one example of a pig command : What is one example of a pig command? What are the process steps to use HDFS, and Pig? What symbol ends each Pig command? Does loading the data take time? Why?
Was there anything surprising or unexpected : Was there anything surprising or unexpected and Was there anything worth investigating further?
How do moral values help in market competition : How does ethics help in the establishment of anti-competitive practices? Explain. How do moral values help in market competition? Explain.
Summarise and justify the equipment you recommend : Summarise and justify the equipment (both hardware and software) you recommend for this new lab that will meet future requirements
Determine communication and awareness strategies : Determine communication and awareness strategies. Establish compliance verification activities and implementation of complaint mechanisms.
How would you define data science : Why has data science become so crucial to businesses and other organizations and What factors have made big data available and accessible, even to smaller

Reviews

Write a Review

Computer Engineering Questions & Answers

  Mathematics in computing

Binary search tree, and postorder and preorder traversal Determine the shortest path in Graph

  Ict governance

ICT is defined as the term of Information and communication technologies, it is diverse set of technical tools and resources used by the government agencies to communicate and produce, circulate, store, and manage all information.

  Implementation of memory management

Assignment covers the following eight topics and explore the implementation of memory management, processes and threads.

  Realize business and organizational data storage

Realize business and organizational data storage and fast access times are much more important than they have ever been. Compare and contrast magnetic tapes, magnetic disks, optical discs

  What is the protocol overhead

What are the advantages of using a compiled language over an interpreted one? Under what circumstances would you select to use an interpreted language?

  Implementation of memory management

Paper describes about memory management. How memory is used in executing programs and its critical support for applications.

  Define open and closed loop control systems

Define open and closed loop cotrol systems.Explain difference between time varying and time invariant control system wth suitable example.

  Prepare a proposal to deploy windows server

Prepare a proposal to deploy Windows Server onto an existing network based on the provided scenario.

  Security policy document project

Analyze security requirements and develop a security policy

  Write a procedure that produces independent stack objects

Write a procedure (make-stack) that produces independent stack objects, using a message-passing style, e.g.

  Define a suitable functional unit

Define a suitable functional unit for a comparative study between two different types of paint.

  Calculate yield to maturity and bond prices

Calculate yield to maturity (YTM) and bond prices

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd