MMI223995 Big Data Platforms Assignment

Assignment Help Other Subject
Reference no: EM132959206

MMI223995 Big Data Platforms - Glasgow Caledonian University

DESIGN A DATA PIPELINE TO PROCESS AN OPEN DATASET

There are many open datasets freely available relating to just about any area of interest or activity, including data about government, science, health, sport, music and so on. There are valuable insights to be gained from analysis of such data. Analyses can be performed using data pipelines which are built from platforms that ingest datasets and then transform and/or combine, store and query/visualise the data as required.

In this assignment you will consider the design and implementation of such a data pipeline, using platforms of the kinds that you have learned about in this module. The focus of the assignment is on the use of platforms capable of scaling to handle "big data", so your design should be based on the use of distributed platforms.

Your assignment consists of two parts: A. Design report

You should first research possible datasets and select the data that you want to use as the basis for your assignment. A list of resources to help you find suitable data will be made available on GCU Learn (see Reading & Links), but you may make use of suitable data from any source that you find. You may want to try to find data related to an area that you have a personal interest in and knowledge about.

You should then proceed to devise and report on a high-level design for a data pipeline that could be used to perform your proposed analysis. The pipeline should include stages as appropriate for: ingest, ETL, storage, analysis/visualisation. The pipeline should be designed for deployment on a single cloud service provider, and the platforms for each stage should be deployable or available as managed services on that provider's infrastructure. You will need to research the offerings that are available for your chosen cloud provider.

This design should consider:

Overall concept
• The original format of the data (e.g. CSV, JSON) and illustration of the data schema.
• Source of the data, e.g. file, streaming. Even if your chosen dataset is only available as file(s) you can consider for your design a scenario where that data would be streamed if it makes sense for your scenario
• Any transformation to be applied to the data as ETL (Extract, Transform, Load)

• Potential analyses and/or visualisations to be performed. Given the focus of this module, I expect that analyses will be based on relatively simple filtering, projection and aggregation, rather than on ML (Machine Learning) algorithms, although there is no specific restriction on the analyses you can include.

Platforms
• The key components of the pipeline: for each component you should select a suitable big data platform (e.g. specific data store, file system, analytic engine) and describe the purpose of that component within your solution
• Interaction/integration between components, e.g. storing from analytic engine to data store
• Software and services that would need to be installed or provisioned and the process of doing so in each case.
• Implementation details, for example: file formats in cases where file system storage will be used; query languages/mechanisms to be used, etc.

You should base your choices on the module content and on additional research, and you should justify your choices. You should include appropriate references. Marks will be awarded on the basis of depth, completeness and relevance of the content within each of the above areas. Your report should be submitted in the form of a Word or PDF document.

B. Prototype

You should implement a prototype that illustrates the processing stages required for your solution to part A, for example ETL, query, visualisation.

You should prepare your complete prototype in the form of a DataBricks notebook making use of Apache Spark for data processing, and you should make use of markdown cells to document your work. The first markdown cell should contain a descriptive title for your prototype and your name and student number. It is suggested that you use Python as the programming language for your implementation, although Scala is an option on DataBricks.

Each processing stage of your pipeline should be represented by one or more executable notebook cells. Storage within your pipeline may be represented by file storage in the Databricks filesystem or by in-memory data structures. Your comments at each point should explain the purpose of the processing, where it fits into the overall data pipeline. It should be clear in your prototype where it is illustrating data being transferred from a storage platform to an analytic platform or vice versa.

Attachment:- Big Data Platforms.rar

Reference no: EM132959206

Questions Cloud

Find the balance of the loan on june : Blake has an agreement with its bank to borrow as needed or to repay loans as funds become available. Find the balance of the loan on June
What are the respective ending balances of the three partner : What are the respective ending balances of the three partners? The partnership consists of Partners A,B and C with ending capital balances.
Find how much does the required return on the riskier stock : An average stock is 10%, and the risk-free rate is 3%. By how much does the required return on the riskier stock exceed that on the less risky stock?
What is ibm cost of equity capital : What is IBM's cost of equity capital using the two methodologies, discounted growth model (DGM) and capital asset pricing model (CAPM) respectively
MMI223995 Big Data Platforms Assignment : MMI223995 Big Data Platforms Assignment Help and Solution, Glasgow Caledonian University - Assessment Writing Service
What would be a new required rate of return : Rate of Return, If Stock A's beta were 1.3, then what would be A's new required rate of return? Suppose rRF = 3%, rM = 8%, and rA = 7%.
How much money must the insurance company pay : Nico himself has medical expenses totaling $70,000. How much money must the insurance company pay out for these three people
What is the return on equity ratio : Wilco Corporation has the following account balances at December 31, 2020. Treasury stock 90,000. What is the return on equity ratio
Find What is LL after-tax cost of debt for LL Incorporated : Find what is LL's after-tax cost of debt?LL Incorporated's currently outstanding 11% coupon bonds have a yield to maturity of 8.6%.

Reviews

Write a Review

Other Subject Questions & Answers

  Corporate sustainability reporting

Critique systems oriented theories that you have learnt in this subject and the literature about the empirical application of the these theories (see Referencing and Style Item 2.2 on page 2) in explaining the motivators for corporate voluntary susta..

  Why do think people are tempted to use straw man fallacy

Why do you think people are tempted to use the straw man fallacy in disagreements on moral issues? How do you feel when someone uses this fallacy

  Why is your chosen policy choice better than other theories

Why is your chosen policy choice better than the other theories? Support your response with at least two different reasons. What are the strengths and/or flaws.

  What is the basis for you argument

Make a presentation to you managements negotiations team with the purpose of convincing them to raise wages and pay a higher percentage of the health care costs

  What is the company history and future outlook

Industry information and news gained via Organizational, Industry and Company websites, Who are the industry leaders and well-known companies in your field? What qualifications are needed to work in this industry? At a particular company?

  Construct a daily diet and exercise plan

Construct a daily diet and exercise plan for someone who is overweight and suffers from high cholesterol.

  In what ways two organization approaches to erm similar

In what ways are the two organization's approaches to ERM similar? How do they differ? Choose one aspect of each ERM implementation from which the other.

  Create a simulation of the children board game

In this assignment, you will create a simulation of the children's board game, Chutes and Ladders. How to save and store the features of the gameboard in R

  What is currently known about the question being asked

This is an investigation of what is currently known about the question being asked. Provide here any results or data that were generated while doing the lab.

  Describe the litigation system in the united states

Define the legal environment that is applicable to the business world and Review basic business law concepts - Apply legal concepts to business to business

  Describe the law enforcement agencies

Describe the law enforcement agencies under the jurisdiction of the Department of Homeland Security and their responsibilities.

  Describe the trends in representation of minorities

Describe the trends in representation of minorities and women in local politics and the impact of their participation.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd