Overview of the general field of big data systems

Assignment Help Other Subject
Reference no: EM132524719

Learning outcome 1: Provide a broad overview of the general field of ‘big data systems'

Learning outcome 2: Developing specialised knowledge in areas that demonstrate the interaction and synergy between ongoing.

Task

You will be given a dataset and a set of problem statements. Where possible (you will need to carefully explain any reasons for not supplying both solutions), you are required implement the solution in both SQL (using either Hive or Impala), and Spark (using PySpark).

General instructions

You will follow a typical data analysis process:
1. Load / ingest the data to be analysed
2. Prepare / clean the data
3. Analyse the data
4. Visualise results / generate report

For steps 1, 2 and 3 you will use the virtual machine (and the software installed on it) that has been provided as part of this module. The data necessary for this assignment will be provided in a MySQL dump format which you will need to download onto the virtual machine and start working with it from there.

The virtual machine has a MySQL server running and you will need to load the data into the MySQL server. This may require an initial perusal of the dataset to eliminate any glaring issues. Once the dataset is loaded, you will be required to use Sqoop to get the data into Hadoop. Before you do any processing, you may dump equivalent CSV files to import into your PySpark version of the solution.

For the cleansing, preparation and analysis you will implement the solution twice (where possible). First in SQL using either Hive or Impala and then in Spark using PySpark.

For the visualisation of the results you are free to use any tool that fulfils the requirements, which can be tools you have learned about such as Python's matplotlib, SAS or Qlik, or any other free open source tool you may find suitable.

To get more than a "Satisfactory" mark, a number of extra features should be implemented. Features include, but are not limited to:

Creation of a single script that executes the entire process of loading the supplied data to exporting the result data required for visualisation.

The creation of a single notebook that executes the entire process of loading the supplied data to exporting the result data required for visualisation. (The creation of two separate notebooks, one for loading data and doing the SQL analysis using either Hive or Impala, and a second notebook for the PySpark part would count under this option).

Implementing the Spark part using a real cluster either using Databricks Community Edition or Google Cloud Platform, or any other cloud provider you may find suitable.

Two separate implementations of the Spark part, one via dataframes, the other via RDDs (note that using SQL directly in the Spark part will not count towards your mark).

Creation of extra visualizations presenting useful information based on your own exploration which is not covered by the other problem statements.

Extraction of statistical information from the data.

Attachment:- Analysis of medical data.rar

Reference no: EM132524719

Questions Cloud

Compute the January ending inventory and cost of goods sold : Compute the January 31 ending inventory and cost of goods sold for January, assuming Mill uses LIFO and a periodic inventory system
Prepare the journal entry to record the interest received : Prepare bond amortization schedule. Prepare the journal entry to record the interest received and amortization for 2018.Prepare journal entry on 1 January 2023.
How will Heathers share of the total distribution be taxed : The Board of Directors voted to distribute this same amount on December 1, 2019. How will Heather's 10% share of the total 2019 distribution be taxed
Discuss particular type of malware : Discuss a particular type of Malware and how has it been used in "today's news" and the respective impact on cybersecurity.
Overview of the general field of big data systems : Provide a broad overview of the general field of ‘big data systems and Developing specialised knowledge in areas that demonstrate the interaction and synergy
Rights and contract enforcement in rationale : What are Property, rights and contract enforcement in rationale for government intervention? Explain please.?
What is the project payback period : An investment project provides cash inflows of $1046 per year for nine years. If the initial cost is $4,200, what is the project payback period?
What is ending inventory : Happy uses a perpetual inventory system. What is ending inventory assuming Happy uses the gross method to record purchases
Discuss the top three risks for this project : Identify and discuss the top three risks for this project and how you will prioritize tackling each one.

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd