UFCFR5-15-3 Data Management Fundamentals Assignment

Assignment Help Other Subject

Reference no: EM133066256

UFCFR5-15-3 Data Management Fundamentals - University of the West of England

Learning Outcome 1: Learn to model, clense, normalize, shard, map, query and analyze substantial real-world big data (230MB+);

Learning Outcome 2: Understand the data cleansing, normalization and sharding processes by writing PYTHON scripts to process and convert the data to first (cleansed) CSV and then (normalized) SQL;

Learning Outcome 3: Design and implement a relational (MySQL) database and then write a PYTHON script to pipe (import) the SQL into the appropriate tables ensuring all referential integerity constraints are met.

Learning Outcome 4: Construct and implement a set of SQL queries to extract data using various filters and constraints.
Map (forward engineer) the data to a NoSQL database of your choice (MongoDB, BaseX, CouchBase, ArangoDB etc.)

Learning Outcome 5: Write a short, reflective report on the learning outcomes you have achieved. Get exposure to and learn the use of a range of data oriented technologies (databases, python & sql.)

Context: Measuring Air Quality

Levels of various air borne pollutants such as Nitrogen Monoxide (NO), Nitrogen Dioxide (NO2) and particulate matter (also called particle pollution) are all major contributors to the measure of overall air quality.

For instance, NO₂ is measured using micrograms in each cubic metre of air (µg m)³. A microgram (µg) is one millionth of a gram. A concentration of 1 µg m means that one cubic metre of air contains one microgram of pollutant.

To protect our health, the UK Government sets two air quality objectives for NO2 in their Air Quality Strategy
1. The hourly objective, which is the concentration of NO2 in the air, averaged over a period of one hour.
2. The annual objective, which is the concentration of NO2 in the air, averaged over a period of a year.

The following table shows the colour encoding and the levels for Objective 1 above, the mean hourly ratio, adopted in the UK.

Index

Band

Low

Moderate

High

Very High

µg/m³

68-

134

135-

200

201-267

268-334

335-400

401-

467

468-

534

535-

600

601

or more

The Input Data

The following ZIP file provides data ranging from 2004 to 03 February 2021 taken from 18 monitoring stations in and around Bristol.
Monitors come and go and may suffer down times, so the data isn't complete for all stations at all times.

Note the following:

There are 18 stations (monitors): 188 => 'AURN Bristol Centre',
203 => 'Brislington Depot', 206 => 'Rupert Street', 209 => 'IKEA M32',
213 => 'Old Market',
215 => 'Parson Street School', 228 => 'Temple Meads Station', 270 => 'Wells Road',
271 => 'Trailer Portway P&R',
375 => 'Newfoundland Road Police Station', 395 => "Shiner's Garage",
452 => 'AURN St Pauls',
447 => 'Bath Road',
459 => 'Cheltenham Road \ Station Road', 463 => 'Fishponds Road',
481 => 'CREATE Centre Roof',
500 => 'Temple Way', 501 => 'Colston Avenue'

Each line represents one reading from a specific detector. Detectors take one reading every hour. If you examine the file using a programming editor, Notepad++ can handle the job, you can see that the first row gives headers and there are another 1408379 (1.4 million+) rows (lines). There are 23 data items (columns) per line.

Task 1: Crop, Cleanse and Refactor the Data

Design & write appropriate PYTHON scripts to carry out the following.

a. Crop the file to delete any records before 00:00 1 Jan 2010 (1262304000).
b. Filter for and remove any dud records where there is no value for SiteID or there is a mismatch between SiteID and Location.
(This script should print the lines number and mismatch field values for each dud record.)

Task 2: Create and Implement a Normalized Database.

a. Use MySQL Workbench or any other tool to create a ER model in the third-normal form to hold the given data.

b. Use the forward engineer feature of MySQL Workbench to generate the SQL schema and implement the database (pollution-db).
(If this does not work for you, e.g. MYSQL Worbench configuration issues, you can use PHPMyAdmin within XAMPP to create the tables by hand. You can then use the export feature to extract the SQL.)

Task 3: Write Python scripts to populate the database & generate SQL.

a. Design and write a PYTHON script (populate.py) that takes the cleaned CSV file as input and creates a new database instance (POLLUTION-DB2) and populates it.

b. Create a PYTHON script (INSERT-100.PY) that generates a SQL file (INSERT-100.SQL) that holds the first 100 inserts to the readings table.

Task 4: Design, Write and Run SQL Queries.

Write and implement (test run) the following four SQL queries:

a. Return the date/time, station name and the highest recorded value of nitrogen oxide (NOx) found in the dataset for the year 2019.
b. Return the mean values of PM2.5 (particulate matter <2.5 micron diameter) & VPM2.5 (volatile particulate matter <2.5 micron diameter) by each station for the year 2019 for readings taken on or near 08:00 hours (peak traffic intensity).
c. Extend the previous query to show these values for all stations in the years 2010 to 2019.

Task 5: Model, implement and query a selected NoSQL database.

Model the data for a specific monitor (station) to a NoSQL data model (key-value, xml or graph) to implement the selected database type/product & pipe or import the data.
You can select from any of the seven databases listed below but if you want, you can select one not currently on the list (after confirmation from the tutor).

Task 6: Reflective Report

A short report in Markdown format (<800 words) reflecting on the assignment, the problems encountered and the solutions found.
In addition you should discuss and outline some of the Python tools and libraries that could be used to visualize this data. What maps / charts with which content?

You should also briefly outline the Learning Outcomes you have managed to achieve in undertaking this Assignment.

Attachment:- Data Management Fundamentals.rar

Reference no: EM133066256

Questions Cloud

What is the budgeted production for the year : The estimated January 1 inventory is 6,500 units, and the desired December 31 inventory is 6,000 units. What is the budgeted production (in units) for the year

What is the research question or research gap : Discuss the references, and critically analyse them and discuss how they reflect the proposed topic. It should reflect theas information for each paper

What is the effective yield of this security : You also know that one year from now exchange rate will be at 1.34 $/£. What is the effective yield of this security if your client holds bond until maturity

How much life insurance will you need to purchase : Using a marginal tax rate of 22% and an inflation rate of 3.5%, how much life insurance will you need to purchase using the "human life value" approach

UFCFR5-15-3 Data Management Fundamentals Assignment : UFCFR5-15-3 Data Management Fundamentals Assignment Help and Solution, University of the West of England - Assessment Writing Service

What are the total return and the current yield : What are the total return, the current yield, and the capital gains yield for the discount bond in Question 3 at $878.00? At $1,134.20

Determine the target sales price : Leon Company is considering the production and sale of a new product with fixed costs of $32,000. Determine the target sales price

What is the estimated annual royalty fee : What is the estimated annual royalty fee that will be paid to Eatzy's should Neo decide to become a franchisee

Examine and apply web-based application architecture : Examine and apply web-based application architecture - Supplementary Assessment Programming Project

User Account

All Pages