Was there anything surprising or unexpected

Assignment Help Computer Engineering
Reference no: EM133482068

Case: Spark SQL is a module of the Apache Spark platform that provides support for working with structured data. It allows you to use a SQL-like language to query and manipulate data stored in various data sources, including structured data files, tables in relational databases, and data stored in Apache Hive.

In the context of big data analytics, Spark SQL can be used to clean, transform, and prepare data for analysis. It can also be used to perform various types of analyses on big data, such as aggregation, filtering, and joining data from different sources. This can help make the big data analytics process more efficient and effective, by providing a high-performance, scalable, and easy-to-use tool for working with structured data.

analyze nyc-tripdata.csv using Spark SQL on the Databricks platform. You will also need to use the taxi zone lookup table using taxi_zone_lookup.csv that maps the location ID into the actual name of the region in NYC. The nyc-tripdata dataset is a modified record of the NYC Green Taxi trips and includes information about the pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, fare amounts, payment types, and driver-reported passenger counts.

Data:

nyc-tripdata.zip

taxi_zone_lookup.csv

Action Items:

Question 1. Use only Firefox, Safari or Chrome when configuring anything related to Databricks.

Question 2. Carefully follow the Databricks Setup Guide listed below. (You should have already downloaded the data needed for this question using the link provided in the Data Section)

Databricks_Setup_Guide_V1.docx

Important Notes: the cluster and tables will need to be re-created periodically. When creating the table again, use the same name. Your SQL code notebook is saved automatically. Search for it by name using the "search" button on the left panel if you cannot find it.

Question 3. Use SQL in Spark to complete the following tasks. Take a screenshot at the end of each exercise.

1) List the top-5 most popular locations for dropoff based on "DOLocationID", sorted in descending order by count. If there is a tie, then one with a lower "DOLocationID" gets listed first. Before solving problem 3 (both questions), you need to filter the data to only keep the rows where "PULocationID" and the "DOLocationID" are different and the "trip_distance" is strictly greater than 2.0 (>2.0).

Output Columbus: DOLocationID, number_of_dropoffs

2) List the top-3 locations with the maximum overall activity. Here, overall activity at a LocationID is simply the sum of all pick-ups and all drop-offs at that LocationID. In case of a tie, the lower LocationID gets listed first. Note: If a taxi picked up 3 passengers at once, we count it as 1 pickup and not 3 pickups.

Output Columbus: LocationID, number_activities

Hint: In order to get the result, you may need to perform a join operation between the two tables.

3) In the Notebook, generate appropriate bar charts using "+" and "Visualization" for the output of the above two questions.

Question 4. Summarize and reflect on what you learned throughout this lab assignment. Note that to get the points, you must mention some details.

Was there anything surprising or unexpected?

Was there anything worth investigating further?

Reference no: EM133482068

Questions Cloud

Discuss the thoughts and opinions on the relationship : Using the video C.J. loses her composure over a funny name. Discuss the thoughts and opinions on the relationship between gender and organizational power.
Discuss the patient protection and affordable care act : The Patient Protection and Affordable Care Act (ACA), colloquially called 'Obama Care' was signed into law in 2010 by the sitting president, President Obama.
What are the characteristics of a digital citizen : What are the characteristics of a digital citizen and What risks are related to digital technology use, and how can you avoid them
What is one example of a pig command : What is one example of a pig command? What are the process steps to use HDFS, and Pig? What symbol ends each Pig command? Does loading the data take time? Why?
Was there anything surprising or unexpected : Was there anything surprising or unexpected and Was there anything worth investigating further?
How do moral values help in market competition : How does ethics help in the establishment of anti-competitive practices? Explain. How do moral values help in market competition? Explain.
Summarise and justify the equipment you recommend : Summarise and justify the equipment (both hardware and software) you recommend for this new lab that will meet future requirements
Determine communication and awareness strategies : Determine communication and awareness strategies. Establish compliance verification activities and implementation of complaint mechanisms.
How would you define data science : Why has data science become so crucial to businesses and other organizations and What factors have made big data available and accessible, even to smaller

Reviews

Write a Review

Computer Engineering Questions & Answers

  Make a web page that contains two selection lists

Pick your favorite sport and search the internet for current roster of players for five teams. design a web page that contains two selection lists: one that displays a drop-down menu of team names and the other a multi-line selection list that dis..

  Write a complete program that demonstrates the functions

Write a function named arrayToFile. The function should accept three arguments: the name of a file, a pointer to an int array, and the size of the array.

  Discuss backus-naur form for production of signed integers

Give the Backus-Naur form for the production of signed integers in decimal notation. (A signed integer is a nonnegative integer preceded by a plus sign.

  Why does a change of case help make a stronger password

Why does a change of case help make a stronger password? How did you choose the password you currently have? Could others follow the same logic and choose a similar password?

  Write an essay on biometrics using given details

In 250 to 300 words, write an essay on the following topic: Many people believe that the use of biometrics is an invasion of privacy.

  How did companies like amazon change their business

How did companies like Amazon change their Business Continuity Plan in order to deal with COVID?

  Why the client receives both the web page

assume you click on a link within your Web browser to obtain a Web page. The IP address for the associated URL is not cached in your local host, so a DNS look-up is necessary to obtain the IP address.

  Create a date class with integer data members for year

Create a Date class with integer data members for year, month, and day. Also include a string data member for the name of the month.

  Write a function day name that consumes a parameter

Write a function day Name that consumes a parameter, day, containing the numerical value of a day in December 2011.

  Why most groupware is deployed today over the web

Describe the kinds of support that groupware can pro- vide to decision makers. Explain why most groupware is deployed today over the Web.

  Read n words and stores them in an array of pointers

Write a program that reads n words and stores them in an array of pointers to characters (array of strings).

  Create a short main() demonstration program

make two classes. The first holds sales transactions. Its private data members include date, amount of sale, and salesperson's ID number.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd