50194 Big Data Tools and Techniques Assignment

Assignment Help Database Management System

Reference no: EM132524700 , Length: word count:3000

Learning outcome 1: Provide a broad overview of the general field of ‘big data systems'

Learning outcome 2: Developing specialised knowledge in areas that demonstrate the interaction and synergy between ongoing.

Task

You will be given a dataset and a set of problem statements. Where possible (you will need to carefully explain any reasons for not supplying both solutions), you are required implement the solution in both SQL (using either Hive or Impala), and Spark (using PySpark).

General instructions

You will follow a typical data analysis process:
1. Load / ingest the data to be analysed
2. Prepare / clean the data
3. Analyse the data
4. Visualise results / generate report

For steps 1, 2 and 3 you will use the virtual machine (and the software installed on it) that has been provided as part of this module. The data necessary for this assignment will be provided in a MySQL dump format which you will need to download onto the virtual machine and start working with it from there.

The virtual machine has a MySQL server running and you will need to load the data into the MySQL server. This may require an initial perusal of the dataset to eliminate any glaring issues. Once the dataset is loaded, you will be required to use Sqoop to get the data into Hadoop. Before you do any processing, you may dump equivalent CSV files to import into your PySpark version of the solution.

For the cleansing, preparation and analysis you will implement the solution twice (where possible). First in SQL using either Hive or Impala and then in Spark using PySpark.

For the visualisation of the results you are free to use any tool that fulfils the requirements, which can be tools you have learned about such as Python's matplotlib, SAS or Qlik, or any other free open source tool you may find suitable.

To get more than a "Satisfactory" mark, a number of extra features should be implemented. Features include, but are not limited to:

Creation of a single script that executes the entire process of loading the supplied data to exporting the result data required for visualisation.

The creation of a single notebook that executes the entire process of loading the supplied data to exporting the result data required for visualisation. (The creation of two separate notebooks, one for loading data and doing the SQL analysis using either Hive or Impala, and a second notebook for the PySpark part would count under this option).

Implementing the Spark part using a real cluster either using Databricks Community Edition or Google Cloud Platform, or any other cloud provider you may find suitable.

Two separate implementations of the Spark part, one via dataframes, the other via RDDs (note that using SQL directly in the Spark part will not count towards your mark).

Creation of extra visualizations presenting useful information based on your own exploration which is not covered by the other problem statements.

Extraction of statistical information from the data.

Attachment:- Analysis of medical data.rar

Reference no: EM132524700

Questions Cloud

What circumstances consider the reducing balance method : Under what circumstances will you consider the reducing balance method as the most appropriate method in calculating depreciation?

Vulnerability of the microsoft operating systems : Discussion is to understand the threat and vulnerability of the Microsoft operating systems and to find ways to mitigate the security breach.

Explain between capital expenditure and revenue expenditure : Explain the difference between capital expenditure and revenue expenditure, and how each type of expenditure will affect the financial statements

Prepare journal entries to record transactions for Sherman : Common stock-$10 par value, 77,000 shares authorized, issued, and outstanding $ 770,000. Prepare journal entries to record transactions for Sherman

50194 Big Data Tools and Techniques Assignment : 50194 Big Data Tools and Techniques Assignment Help and Solution, Analysis of medical data - Assessment Writing Service - Provide a broad overview of the

Calculate combined degree of leverage : Calculate Combined Degree of Leverage. Albatross Airline's fixed operating costs are 5.8 million dollar, and its variable cost ratio .20.

What are the fundamentals of double entry bookkeeping : What are the fundamentals of double entry bookkeeping? How would you state the debit and credit rule in double-entry bookkeeping?

Artificial intelligence assignment : Identify two existing definitions of AI. Select the definition that you like the most and justify your choice.

What personal factors will cause hunter insurance : What personal factors will cause Hunter's insurance rate to be higher than his counterparts?

User Account

All Pages