Reference no: EM132524700 , Length: word count:3000
Learning outcome 1: Provide a broad overview of the general field of ‘big data systems'
Learning outcome 2: Developing specialised knowledge in areas that demonstrate the interaction and synergy between ongoing.
Task
You will be given a dataset and a set of problem statements. Where possible (you will need to carefully explain any reasons for not supplying both solutions), you are required implement the solution in both SQL (using either Hive or Impala), and Spark (using PySpark).
General instructions
You will follow a typical data analysis process:
1. Load / ingest the data to be analysed
2. Prepare / clean the data
3. Analyse the data
4. Visualise results / generate report
For steps 1, 2 and 3 you will use the virtual machine (and the software installed on it) that has been provided as part of this module. The data necessary for this assignment will be provided in a MySQL dump format which you will need to download onto the virtual machine and start working with it from there.
The virtual machine has a MySQL server running and you will need to load the data into the MySQL server. This may require an initial perusal of the dataset to eliminate any glaring issues. Once the dataset is loaded, you will be required to use Sqoop to get the data into Hadoop. Before you do any processing, you may dump equivalent CSV files to import into your PySpark version of the solution.
For the cleansing, preparation and analysis you will implement the solution twice (where possible). First in SQL using either Hive or Impala and then in Spark using PySpark.
For the visualisation of the results you are free to use any tool that fulfils the requirements, which can be tools you have learned about such as Python's matplotlib, SAS or Qlik, or any other free open source tool you may find suitable.
To get more than a "Satisfactory" mark, a number of extra features should be implemented. Features include, but are not limited to:
Creation of a single script that executes the entire process of loading the supplied data to exporting the result data required for visualisation.
The creation of a single notebook that executes the entire process of loading the supplied data to exporting the result data required for visualisation. (The creation of two separate notebooks, one for loading data and doing the SQL analysis using either Hive or Impala, and a second notebook for the PySpark part would count under this option).
Implementing the Spark part using a real cluster either using Databricks Community Edition or Google Cloud Platform, or any other cloud provider you may find suitable.
Two separate implementations of the Spark part, one via dataframes, the other via RDDs (note that using SQL directly in the Spark part will not count towards your mark).
Creation of extra visualizations presenting useful information based on your own exploration which is not covered by the other problem statements.
Extraction of statistical information from the data.
Attachment:- Analysis of medical data.rar