50194 Big Data Tools and Techniques Assignment

Assignment Help Database Management System
Reference no: EM132524700 , Length: word count:3000

Learning outcome 1: Provide a broad overview of the general field of ‘big data systems'

Learning outcome 2: Developing specialised knowledge in areas that demonstrate the interaction and synergy between ongoing.

Task

You will be given a dataset and a set of problem statements. Where possible (you will need to carefully explain any reasons for not supplying both solutions), you are required implement the solution in both SQL (using either Hive or Impala), and Spark (using PySpark).

General instructions

You will follow a typical data analysis process:
1. Load / ingest the data to be analysed
2. Prepare / clean the data
3. Analyse the data
4. Visualise results / generate report

For steps 1, 2 and 3 you will use the virtual machine (and the software installed on it) that has been provided as part of this module. The data necessary for this assignment will be provided in a MySQL dump format which you will need to download onto the virtual machine and start working with it from there.

The virtual machine has a MySQL server running and you will need to load the data into the MySQL server. This may require an initial perusal of the dataset to eliminate any glaring issues. Once the dataset is loaded, you will be required to use Sqoop to get the data into Hadoop. Before you do any processing, you may dump equivalent CSV files to import into your PySpark version of the solution.

For the cleansing, preparation and analysis you will implement the solution twice (where possible). First in SQL using either Hive or Impala and then in Spark using PySpark.

For the visualisation of the results you are free to use any tool that fulfils the requirements, which can be tools you have learned about such as Python's matplotlib, SAS or Qlik, or any other free open source tool you may find suitable.

To get more than a "Satisfactory" mark, a number of extra features should be implemented. Features include, but are not limited to:

Creation of a single script that executes the entire process of loading the supplied data to exporting the result data required for visualisation.

The creation of a single notebook that executes the entire process of loading the supplied data to exporting the result data required for visualisation. (The creation of two separate notebooks, one for loading data and doing the SQL analysis using either Hive or Impala, and a second notebook for the PySpark part would count under this option).

Implementing the Spark part using a real cluster either using Databricks Community Edition or Google Cloud Platform, or any other cloud provider you may find suitable.

Two separate implementations of the Spark part, one via dataframes, the other via RDDs (note that using SQL directly in the Spark part will not count towards your mark).

Creation of extra visualizations presenting useful information based on your own exploration which is not covered by the other problem statements.

Extraction of statistical information from the data.

Attachment:- Analysis of medical data.rar

Reference no: EM132524700

Questions Cloud

What circumstances consider the reducing balance method : Under what circumstances will you consider the reducing balance method as the most appropriate method in calculating depreciation?
Vulnerability of the microsoft operating systems : Discussion is to understand the threat and vulnerability of the Microsoft operating systems and to find ways to mitigate the security breach.
Explain between capital expenditure and revenue expenditure : Explain the difference between capital expenditure and revenue expenditure, and how each type of expenditure will affect the financial statements
Prepare journal entries to record transactions for Sherman : Common stock-$10 par value, 77,000 shares authorized, issued, and outstanding $ 770,000. Prepare journal entries to record transactions for Sherman
50194 Big Data Tools and Techniques Assignment : 50194 Big Data Tools and Techniques Assignment Help and Solution, Analysis of medical data - Assessment Writing Service - Provide a broad overview of the
Calculate combined degree of leverage : Calculate Combined Degree of Leverage. Albatross Airline's fixed operating costs are 5.8 million dollar, and its variable cost ratio .20.
What are the fundamentals of double entry bookkeeping : What are the fundamentals of double entry bookkeeping? How would you state the debit and credit rule in double-entry bookkeeping?
Artificial intelligence assignment : Identify two existing definitions of AI. Select the definition that you like the most and justify your choice.
What personal factors will cause hunter insurance : What personal factors will cause Hunter's insurance rate to be higher than his counterparts?

Reviews

Write a Review

Database Management System Questions & Answers

  Knowledge and data warehousing

Design a dimensional model for analysing Purchases for Adventure Works Cycles and implement it as cubes using SQL Server Analysis Services. The AdventureWorks OLTP sample database is the data source for you BI analysis.

  Design a database schema

Design a Database schema

  Entity-relationship diagram

Create an entity-relationship diagram and design accompanying table layout using sound relational modeling practices and concepts.

  Implement a database of courses and students for a school

Implement a database of courses and students for a school.

  Prepare the e-r diagram for the movie database

Energy in the home, personal energy use and home energy efficiency and Efficient use of ‘waste' heat and renewable heat sources

  Design relation schemas for the entire database

Design relation schemas for the entire database.

  Prepare the relational schema for database

Prepare the relational schema for database

  Data modeling and normalization

Data Modeling and Normalization

  Use cases perform a requirements analysis for the case study

Use Cases Perform a requirements analysis for the Case Study

  Knowledge and data warehousing

Knowledge and Data Warehousing

  Stack and queue data structure

Identify and explain the differences between a stack and a queue data structure

  Practice on topic of normalization

Practice on topic of Normalization

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd