Reference no: EM133170154
Scenario: You're a data scientist at Uber -- sitting in a war room on March 16, 2020, 1 day after California-wide COVID lockdown measures began and the day shelter-in-place measures are announced in the bay area. The entire data science department is on fire: All of your existing traffic models have regressed significantly. Given the sudden change in traffic patterns (i.e., no traffic at all), the company's traffic estimates are wildly incorrect.
This is a top priority for the company. Since traffic estimates are used directly for pricing strategies, this is actively costing the company millions every hour. You are tasked with fixing these models. Takeaways: How do you "fix" models that have learned biases from pre-lockdown traffic? How do you train new ones, with just 24 hours of data? What sorts of data do you examine, to better understand the situation? In the midst of company-wide panic, you'll need a strong inferential acumen to lead a robust data science response. In this project, we'll walk you through a simulated war room data science effort, culminating in some strategies to fix models online, which are experiencing large distributional shifts in data. For this project, we'll explore traffic data provided by the Uber Movement dataset, specifically around the start of COVID shutdowns in March 2020. Your project is structured around the following ideas:
1. Guided data cleaning: Clustering data spatially
a. Load Uber traffic speeds dataset
b. Map traffic speeds to Google Plus Codes (spatially uniform) i. Load node-to-gps-coordinates data ii. Map traffic speed to GPS coordinates iii. Convert GPS coordinates to plus code regions iv. Sanity check number of plus code regions in San Francisco v. Plot a histogram of the standard deviation in speed, per plus code region. c. Map traffic speeds to census tracts (spatially non-uniform) i. Download census tracts geojson ii. Map traffic speed to census tracts iii. Sanity check number of census tracts in San Francisco with data. iv. Plot a histogram of the standard deviation in speed, per census tract. d. What defines a "good" or "bad" spatial clustering?
2. Guided EDA: Understanding COVID lockdown impact on traffic a. How did lockdown affect average traffic speeds? i. Sort census tracts by average speed, pre-lockdown. ii. Sort census tracts by average speed, post-lockdown. iii. Sort census tracts by change in average speed, from pre to post lockdown. iv. Quantify the impact of lockdown on average speeds. v. Quantify the impact of pre-lockdown average speed on change in speed. b. What traffic areas were impacted by lockdown? i. Visualize heatmap of average traffic speed per census tract, pre-lockdown. ii. Visualize change in average daily speeds pre vs. post lockdown. iii. Quantify the impact of lockdown on daily speeds, spatially.
3. Open-Ended EDA: Understanding lockdown impact on traffic times a. Download Uber Movement (Travel Times) dataset
4. Guided Modeling: Predict traffic speed post-lockdown a. Predict daily traffic speed on pre-lockdown data i. Assemble dataset to predict daily traffic speed. ii. Train and evaluate linear model on pre-lockdown data. b. Understand failures on post-lockdown data i. Evaluate on post-lockdown data ii. Report model performance temporally c. "Fix" model on post-lockdown data i. Learn delta off of a moving bias ii. Does it "solve itself"? Does the pre-lockdown model predict, after the change point? iii. Naively retrain model with post-lockdown data iv. What if you just ignore the change point? 5. Open-Ended Modeling: Predicting travel times post-lockdown
This is the final assignment for the graduate class on Data Science.
The assignment is in two parts. The first part is from questions 1 to 2: in the jupuyter notebook. The second part is from question 3 to 5 jupyter notebook.
Please refer to the fa21_discussions_dev.pdf - Report Format and Submission for the detail on the assignment. This will describe what is the expectation from the final report. Please note that there are three types of assignment AQI, COVID and Traffic - my dataset is "Traffic".
Dataset: Traffic dataset
This is the assigned data for us. It contains information regarding the traffic speed date-wise for SanFrancisco. The objective was to understand the impact of covid on traffic speed.
Part 1 of the assignment:
In the first part of the assignment, we first developed the data - created geodataframe, performed s join and finally conducted some EDA to understand if the traffic speed in SanFranacisco were impacted by COVID - before and after lockdown.
Furthermore, based on the preliminary EDA, we submitted a hypothesis to test (design doc traffic) in the second part of the assignment.
Hypothesis: Even though the number of vehicles on the road decreased post covid lockdown, speeds did not change drastically in specific locations because there are other confounding variables that affect traffic speed at any given location such as traffic light density.
Approach for the Part 2:
In the second part of the assignment, we need to understand if the covid lockdown impacted the vehicle speed or other confounding factors as traffic. The idea is that in an area with large number of traffic lights would have less speed. And this would stay constant post-COVID as well.
Hence we downloaded the dataset for traffic lights, and spotlights, speed limit (PFA) in the city of San Franciso.
We are also given the dataset of Daily travel times - this has the data on mean travel speed - datewise, for the specific geography.
The geography can be traced through the geometry column and/or corresponding movement Id column.
Now we need to merge the data of traffic light, and the previous (part 1) dataset of traffic speeds_to_tract and times_to_tract to understand if the speed has changed before and after covid.
Question 1:
It is an open-ended EDA. We can basically run EDA to test:
- If there is a change in traffic speed before and after covid
- If the change in traffic speed is related high number of traffic signals/stop signs/ speed limit
- identify if the change in the speed is correlated to traffic signal/stop sign/speed limit/ covid lockdown
Question 2:
This is guided modelling: In this step, you'll train a model to predict traffic speed.
In this question, we are given prompts, we just need to write codes as directed.
Question 3:
Open-Ended Modeling: Predicting travel time post-lockdown.
a: Train a baseline model of your choice using any supervised learning approach we have studied; you are not limited to a linear model.
b: Improve on your baseline model. Specify the model you designed and its input features. Justify why you chose these features and their relevance to your model's predictions.
In this, we will have to develop, train and improve the model.
Attachment:- discussions_dev.rar