Reference no: EM132505936
Part 1 Overview
In this MP, we are going to explore AWS RDS, ElastiCache as well as S3, and how using ElastiCache can boost RDS performance. Specifically, you will build a storage service to support read and write queries on data stored in a relational database. The read and write APIs are built using Lambda functions. Your goal is to demonstrate the performance benefits of using caching to improve performance of database queries.You have also been provided an airlines dataset consisting of 5 CSV files and you need to analyze this data and obtain meaningful insights from it. In order to do so, you would first have to perform a simplified version of ETL (Extract, Transform, Load) on these CSV files using AWS Glue. You will then select a subset of this data based on some constraints through AWS Athena using a SQL query. This filtered dataset will then be used inside Tableau for visualization and gaining further insight into it.
Requirements
You need a valid AWS account and will be working on AWS Athena, Glue, Lambda, API Gateway, RDS, ElastiCache and S3. Also, you need to be familiar with one of the following programming languages for implementing lambda: Python / Javascript / Java / Go. While we will make every attempt to help out irrespective of your chosen language, we can best assist with python 3. AWS Athena and Glue are available on educate accountsand install Tableau software on your local machine. You will need to learn SQL to perform filtering on top of Athena and get familiar with Tableau. Please use region -> us-east-1 for your deployment.
Part 3 Populate database and set up Redis cluster
3.1 Amazon S3 setup
First, we setup S3 bucket and load the content for populating the database.
Step 1: Set up an Amazon S3 bucket and upload mp11input.csv into the bucket, reference the tutorial:
3.2 Setup and populate Amazon Aurora
Step 2: Create an IAM policy and IAM role for Amazon Aurora to access Amazon S3
Next, create an IAM policy and IAM role for Amazon Aurora database, use the following tutorials for reference.
Step 3: Setup Amazon Aurora database
Create an Amazon Aurora database. Reference the 2 tutorials listed below, and use the following settings: Make sure publicly accessible is selected. For the DB cluster parameter group and DB parameter group, choose the ones created in step 1 and step 2. Attach the IAM role created in step 2 to the database just
Step 4: Populate database using data in Amazon S3
Next, load data from mp11input.csv which is stored in S3 bucket created in step 1 into database. Reference the tutorial:
3.3 Create a Redis Cluster
For creating a redis cluster, you can refer to the following tutorial:
Part 4 Create Lambda function
4.1 Setting up Lambda function
Unlike some of the previous MPs, third party libraries can be imported in this MP. If you choose to use Python for this MP, you will need to refer to the following tutorial for importing two modules pymysql and redis:
4.2 Read/Write API
There are different strategies of using cache. You can refer to this document for more
Specifically, the autograder will send read requests with the following format:
{
"USE_CACHE": "True",
"REQUEST": "read",
"SQLS": [
"25"
]
}
We would like you to return the corresponding rows with the ids that's in SQLS list (you need to create a list and append the corresponding rows to the list, which you need to return at the end of the api call). If USE_CACHE is true, check if Redis contains the id. If not, store the row returned by RDS in Redis, and then return the result. Note that choice of key and value in Redis is totally up to you. The autograder will only ask you for the corresponding rows of the ids.
Our lazy loading test will check if your implementation takes advantage of Redis and boosts the perfomance of querying data. You can test that on your own before submitting.
The write requests examples are as follows,
{
"USE_CACHE": "True",
"REQUEST": "write",
"SQLS": [
{
"hero": "yes",
"name": "fireman",
"power": "fire",
"color": "red",
"xp": "10"
},
{
"hero": "no",
"name": "dogman",
"power": "bark",
"color": "brown",
"xp": "50"
}
]
}
For write request, you will have to insert the new characters in Aurora. If USE_CACHE is true, we would like you to use write-through strategy. Our autograder will check whether your write-through implementation improves the read performance.
Return a JSON message to the API call as follows:
{
"statusCode": 200,
"body": result
}
where the result should be rows of data if responding to a read API call, and "write success" when responding to a write API call.
5 API setup
Set up an API endpoint for your Lambda function.
Part 7 Virtualization Procedure
7.1 AWS Glue Setup
You can download the dataset (flights-dataset.zip) ~ 50MB from below. Once you download this dataset, you will have to upload the csv files to your AWS account S3 storage in order for Glue to access it.
In order to run your ETL job on this S3 dataset, you need to first configure AWS Glue so that it has the metadata required to read from S3 and create/monitor your jobs. You will then use a crawler to populate the Glue Data catalog with the metadata for the datastore (S3 Bucket where the flights csv files are stored). Create a Database within the data catalog and run the crawler. It will create a table under that database. You should manually verify the relevant properties and the schema in order to ensure that the correct ones have been extracted by the crawler. The database and the table that you created through AWS Glue are also going to reflect in Athena.
7.2 Query and Filtering through AWS Athena
You should be able to view the table created through the Glue inside AWS Athena reflecting the total entries of 479230 (with header trimmed). You need to now come up with a SQL query for the following:
• Select all possible trips from SFO to JFK with 1 stop-over airport in between such that:
1. 60 minutes <= stop-over time <= 180 minutes. Here, the stop-over time is the difference between the arrival time of the first flight and departure time of the second flight. Note that we are using arrival time and departure time (not scheduled arrival time or scheduled departure time). Also, note that the format of timings in the dataset is in integer format as HHMM (i.e., 1730 means 5.30 PM). It is not necessary to covert the times to DateTime format, rather you can apply some simple calculations over the integer format.
2. The first flight leaving from SFO as well as the second flight leaving from the stop-over airport should have the same DAY column value.
3. Both the flights on a given trip should belong to the same airline.
You need to output the following columns for each trip: Day, Airline, Origin Airport, Stop-over Airport, Destination Airport, Origin Departure Delay, Stop-over Arrival Delay, Stop-over Departure Delay, Destination Arrival Delay. Note that there should be one row for each trip in the output. Please save the count of entries you obtain from your query as you would need to submit this parameter to the autograder.
Save the output results from this query in a CSV file that will be used by Tableau for the tasks below.
7.3 Tableau Setup
As a student, you're eligible for a free Tableau Desktop license through Tableau for Students. In order to download and install Tableau, you need to first fill out this form, providing your .illinois.edu email address.
Next, you will receive an email on your email account with download links along with the product key. You will be using the product key during installation.
Note that Tableau Desktop can only be installed on MacOS (version >=10.13) or Windows. You may have to use virtual machines if you are using other OSes.
If you would like more info on how to use Tableau, please check out the Tableau Tutorial.You can also refer to the first 30 minutes of this video.
7.4 Tableau Visualization
Using the output CSV from 3.2, you need to create two visualizations.
In the Tableau window, click on File menu and then New. Next, you need to connect to the CSV file. Once connected, you create two worksheets one for each of the tasks below, and save the results as mentioned.
1. Create a bar chart to show the count of flights between SFO and different stop-over airports. Now you need to save the data corresponding this visualization as follows: Right-click on the visualization, then click "View Data", ensure that "Summary Data" is selected at the bottom, and finally click on "Export All" to save the data into a file named "mp9-viz1.tsv" or "mp9-viz1.csv" depending on whether your file is tab or comma separated.
2. Create a scatter plot with Stop-over Departure Delay at the intermediate airport as x-axis and Destination Arrival Delay at JFK as Y-axis. Now create two filters: one for Stop-over Departure Delay, and another for Destination Arrival Delay. Set the range of Departure Delay filter from 200 to 260 and Arrival Delay filter from 110 to 220. Similar to the above task, you can now save the data corresponding to this visualization as follows: Right-click on the visualization, then click "View Data", ensure that "Summary Data" is selected at the bottom, and finally click on "Export All" to save the data into a file named "mp9-viz2.tsv" or "mp9-viz2.csv" depending on whether your file is tab or comma separated.
Attachment:- AWS Project.rar