50194 Big Data Tools and Techniques Assignment

Assignment Help Database Management System
Reference no: EM132524700 , Length: word count:3000

Learning outcome 1: Provide a broad overview of the general field of ‘big data systems'

Learning outcome 2: Developing specialised knowledge in areas that demonstrate the interaction and synergy between ongoing.

Task

You will be given a dataset and a set of problem statements. Where possible (you will need to carefully explain any reasons for not supplying both solutions), you are required implement the solution in both SQL (using either Hive or Impala), and Spark (using PySpark).

General instructions

You will follow a typical data analysis process:
1. Load / ingest the data to be analysed
2. Prepare / clean the data
3. Analyse the data
4. Visualise results / generate report

For steps 1, 2 and 3 you will use the virtual machine (and the software installed on it) that has been provided as part of this module. The data necessary for this assignment will be provided in a MySQL dump format which you will need to download onto the virtual machine and start working with it from there.

The virtual machine has a MySQL server running and you will need to load the data into the MySQL server. This may require an initial perusal of the dataset to eliminate any glaring issues. Once the dataset is loaded, you will be required to use Sqoop to get the data into Hadoop. Before you do any processing, you may dump equivalent CSV files to import into your PySpark version of the solution.

For the cleansing, preparation and analysis you will implement the solution twice (where possible). First in SQL using either Hive or Impala and then in Spark using PySpark.

For the visualisation of the results you are free to use any tool that fulfils the requirements, which can be tools you have learned about such as Python's matplotlib, SAS or Qlik, or any other free open source tool you may find suitable.

To get more than a "Satisfactory" mark, a number of extra features should be implemented. Features include, but are not limited to:

Creation of a single script that executes the entire process of loading the supplied data to exporting the result data required for visualisation.

The creation of a single notebook that executes the entire process of loading the supplied data to exporting the result data required for visualisation. (The creation of two separate notebooks, one for loading data and doing the SQL analysis using either Hive or Impala, and a second notebook for the PySpark part would count under this option).

Implementing the Spark part using a real cluster either using Databricks Community Edition or Google Cloud Platform, or any other cloud provider you may find suitable.

Two separate implementations of the Spark part, one via dataframes, the other via RDDs (note that using SQL directly in the Spark part will not count towards your mark).

Creation of extra visualizations presenting useful information based on your own exploration which is not covered by the other problem statements.

Extraction of statistical information from the data.

Attachment:- Analysis of medical data.rar

Reference no: EM132524700

Questions Cloud

What circumstances consider the reducing balance method : Under what circumstances will you consider the reducing balance method as the most appropriate method in calculating depreciation?
Vulnerability of the microsoft operating systems : Discussion is to understand the threat and vulnerability of the Microsoft operating systems and to find ways to mitigate the security breach.
Explain between capital expenditure and revenue expenditure : Explain the difference between capital expenditure and revenue expenditure, and how each type of expenditure will affect the financial statements
Prepare journal entries to record transactions for Sherman : Common stock-$10 par value, 77,000 shares authorized, issued, and outstanding $ 770,000. Prepare journal entries to record transactions for Sherman
50194 Big Data Tools and Techniques Assignment : 50194 Big Data Tools and Techniques Assignment Help and Solution, Analysis of medical data - Assessment Writing Service - Provide a broad overview of the
Calculate combined degree of leverage : Calculate Combined Degree of Leverage. Albatross Airline's fixed operating costs are 5.8 million dollar, and its variable cost ratio .20.
What are the fundamentals of double entry bookkeeping : What are the fundamentals of double entry bookkeeping? How would you state the debit and credit rule in double-entry bookkeeping?
Artificial intelligence assignment : Identify two existing definitions of AI. Select the definition that you like the most and justify your choice.
What personal factors will cause hunter insurance : What personal factors will cause Hunter's insurance rate to be higher than his counterparts?

Reviews

Write a Review

Database Management System Questions & Answers

  Find out the internal schema of the above database

Find out the internal schema of the above database

  List four of mintzbergs decisional roles of managers

List four of Mintzberg's Decisional roles of managers. What storage system and processing algorithm were developed by Google for Big Data?

  Identify challenge that are specific to health care industry

Identify challenges that are specific to the health care industry that will need to be addressed during the transition to an agile project format

  Explain second normal and third normal form

1. SQL Server 2000 Architecture with diagram. 2. Explain Second Normal and Third Normal Form, 3. Explain query engine and storage manager in MySQL architecture.

  Database modeling and normalization

Database Modeling and Normalization

  Suppose we have a relation employees ssn name department

suppose we have a relation employees ssn name department salary.nbspfor each of the following queries either write the

  Primary keys and indexes

Primary Keys and Indexes

  What is a relational database

What is a relational database-Write a 1500 word essay on the above question. Your response should include the evolution of 1st generation data models

  Generate a database diagram

For each of these statements, include a screenshot of the SQL. Make sure to include the statement execution, including the resulting data. Display all columns and all rows from the Employees table.

  Explain multidimensional analysis

Give at least three reasons why ETL functions are most challenging in a data warehouse environment.

  Write select statement that returns two columns

Write a SELECT statement that returns two columns: VendorName and PaymentSum, where PaymentSum is the sum of the PaymentTotal column.

  Write an algorithm that inputs the lunch costs

Write an algorithm that inputs the lunch costs for each the ten employees while accumulating the total cost of the lunch.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd