Reference no: EM133066256
UFCFR5-15-3 Data Management Fundamentals - University of the West of England
Learning Outcome 1: Learn to model, clense, normalize, shard, map, query and analyze substantial real-world big data (230MB+);
Learning Outcome 2: Understand the data cleansing, normalization and sharding processes by writing PYTHON scripts to process and convert the data to first (cleansed) CSV and then (normalized) SQL;
Learning Outcome 3: Design and implement a relational (MySQL) database and then write a PYTHON script to pipe (import) the SQL into the appropriate tables ensuring all referential integerity constraints are met.
Learning Outcome 4: Construct and implement a set of SQL queries to extract data using various filters and constraints.
Map (forward engineer) the data to a NoSQL database of your choice (MongoDB, BaseX, CouchBase, ArangoDB etc.)
Learning Outcome 5: Write a short, reflective report on the learning outcomes you have achieved. Get exposure to and learn the use of a range of data oriented technologies (databases, python & sql.)
Context: Measuring Air Quality
Levels of various air borne pollutants such as Nitrogen Monoxide (NO), Nitrogen Dioxide (NO2) and particulate matter (also called particle pollution) are all major contributors to the measure of overall air quality.
For instance, NO2 is measured using micrograms in each cubic metre of air (µg m)3. A microgram (µg) is one millionth of a gram. A concentration of 1 µg m means that one cubic metre of air contains one microgram of pollutant.
To protect our health, the UK Government sets two air quality objectives for NO2 in their Air Quality Strategy
1. The hourly objective, which is the concentration of NO2 in the air, averaged over a period of one hour.
2. The annual objective, which is the concentration of NO2 in the air, averaged over a period of a year.
The following table shows the colour encoding and the levels for Objective 1 above, the mean hourly ratio, adopted in the UK.
Index
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
Band
|
Low
|
Low
|
Low
|
Moderate
|
Moderate
|
Moderate
|
High
|
High
|
High
|
Very High
|
µg/m³
|
0-
67
|
68-
134
|
135-
200
|
201-267
|
268-334
|
335-400
|
401-
467
|
468-
534
|
535-
600
|
601
or more
|
The Input Data
The following ZIP file provides data ranging from 2004 to 03 February 2021 taken from 18 monitoring stations in and around Bristol.
Monitors come and go and may suffer down times, so the data isn't complete for all stations at all times.
Note the following:
There are 18 stations (monitors): 188 => 'AURN Bristol Centre',
203 => 'Brislington Depot', 206 => 'Rupert Street', 209 => 'IKEA M32',
213 => 'Old Market',
215 => 'Parson Street School', 228 => 'Temple Meads Station', 270 => 'Wells Road',
271 => 'Trailer Portway P&R',
375 => 'Newfoundland Road Police Station', 395 => "Shiner's Garage",
452 => 'AURN St Pauls',
447 => 'Bath Road',
459 => 'Cheltenham Road \ Station Road', 463 => 'Fishponds Road',
481 => 'CREATE Centre Roof',
500 => 'Temple Way', 501 => 'Colston Avenue'
Each line represents one reading from a specific detector. Detectors take one reading every hour. If you examine the file using a programming editor, Notepad++ can handle the job, you can see that the first row gives headers and there are another 1408379 (1.4 million+) rows (lines). There are 23 data items (columns) per line.
Task 1: Crop, Cleanse and Refactor the Data
Design & write appropriate PYTHON scripts to carry out the following.
a. Crop the file to delete any records before 00:00 1 Jan 2010 (1262304000).
b. Filter for and remove any dud records where there is no value for SiteID or there is a mismatch between SiteID and Location.
(This script should print the lines number and mismatch field values for each dud record.)
Task 2: Create and Implement a Normalized Database.
a. Use MySQL Workbench or any other tool to create a ER model in the third-normal form to hold the given data.
b. Use the forward engineer feature of MySQL Workbench to generate the SQL schema and implement the database (pollution-db).
(If this does not work for you, e.g. MYSQL Worbench configuration issues, you can use PHPMyAdmin within XAMPP to create the tables by hand. You can then use the export feature to extract the SQL.)
Task 3: Write Python scripts to populate the database & generate SQL.
a. Design and write a PYTHON script (populate.py) that takes the cleaned CSV file as input and creates a new database instance (POLLUTION-DB2) and populates it.
b. Create a PYTHON script (INSERT-100.PY) that generates a SQL file (INSERT-100.SQL) that holds the first 100 inserts to the readings table.
Task 4: Design, Write and Run SQL Queries.
Write and implement (test run) the following four SQL queries:
a. Return the date/time, station name and the highest recorded value of nitrogen oxide (NOx) found in the dataset for the year 2019.
b. Return the mean values of PM2.5 (particulate matter <2.5 micron diameter) & VPM2.5 (volatile particulate matter <2.5 micron diameter) by each station for the year 2019 for readings taken on or near 08:00 hours (peak traffic intensity).
c. Extend the previous query to show these values for all stations in the years 2010 to 2019.
Task 5: Model, implement and query a selected NoSQL database.
Model the data for a specific monitor (station) to a NoSQL data model (key-value, xml or graph) to implement the selected database type/product & pipe or import the data.
You can select from any of the seven databases listed below but if you want, you can select one not currently on the list (after confirmation from the tutor).
Task 6: Reflective Report
A short report in Markdown format (<800 words) reflecting on the assignment, the problems encountered and the solutions found.
In addition you should discuss and outline some of the Python tools and libraries that could be used to visualize this data. What maps / charts with which content?
You should also briefly outline the Learning Outcomes you have managed to achieve in undertaking this Assignment.
Attachment:- Data Management Fundamentals.rar