MMI223995 Big Data Platforms Assignment

Assignment Help Other Subject
Reference no: EM132598263

MMI223995 Big Data Platforms - Glasgow Caledonian University

DESIGN A DATA PIPELINE TO PROCESS AN OPEN DATASET

There are many open datasets freely available relating to just about any area of interest or activity, including data about government, science, health, sport, music and so on. There are valuable insights to be gained from analysis of such data. Analyses can be performed using data pipelines which are built from platforms that ingest datasets and then transform and/or combine, store and query/visualise the data as required.

In this assignment you will consider the design and implementation of such a data pipeline, using platforms of the kinds that you have learned about in this module. The focus of the assignment is on the use of platforms capable of scaling to handle "big data", so your design should be based on the use of distributed platforms.

A. Design report
You should first research possible datasets and select the data that you want to use as the basis for your assignment. A list of resources to help you find suitable data will be made available on GCU Learn (see Reading & Links), but you may make use of suitable data from any source that you find. You may want to try to find data related to an area that you have a personal interest in and knowledge about.

IMPORTANT: before proceeding you MUST get approval from me for your choice of dataset. You must email me with the following information, and await approval:
• Name/nature of the dataset(s)
• URL(s) from which dataset(s) can be downloaded
• A brief description of the purpose for which you propose to use the data

If the dataset you choose is not considered to be suitable, or has been chosen already by another student, you may be asked to find an alternative. I expect that each student will use a different dataset.

You should then proceed to devise and report on a high-level design for a data pipeline that could be used to perform your proposed analysis.

This design should consider:

Overall concept
• The original format of the data (e.g. CSV, JSON) and illustration of the data schema.
• Any transformation to be applied to the data as ETL (Extract, Transform, Load)
• Potential analyses and/or visualisations to be performed. Given the focus of this module, I expect that analyses will be based on relatively simple filtering, projection and aggregation, rather than on ML (Machine Learning) algorithms, although there is no specific restriction on the analyses you can include.
Platforms
• The key components of the pipeline: for each component you should select a suitable big data platform (e.g. specific data store, file system, analytic engine) and describe the purpose of that component within your solution

Integration and deployment
• Interaction/integration between components, e.g. storing from analytic engine to data store
• File formats in cases where file system storage will be used
• Software that would need to be installed or provisioned, including connector libraries where required
• Physical deployment of components

You should base your choices on the module content and on additional research, and you should justify your choices. You should include appropriate references. Marks will be awarded on the basis of depth, completeness and relevance of the content within each of the above areas. Your report should be submitted in the form of a Word or PDF document.

B. Prototype

You should implement a prototype that illustrates the processing stages required for your solution to part A, for example ETL, query, visualisation.

You should prepare your prototype in the form of a notebook, either a Jupyter notebook which runs on the local machine or in a cloud service within Azure, or a DataBricks notebook, and you should make use of markdown cells to document your work. The first markdown cell should contain a descriptive title for your prototype and your name and student number

It is suggested that you use Python as the programming language for your implementation, although Scala is an option if you use DataBricks. You should explain the purpose of the processing, where it fits into the overall data pipeline, and the steps involved in the data ingest, processing and output. You may wish to implement integration of components where appropriate to illustrate your design, e.g. integration of an analytic engine and a data store.

Note that while your design should be based on the use of platforms deployed on clusters, it is sufficient for testing your prototype to run on a local standalone computer or on the limited (single-node) clusters typically available in the free tier of cloud-based services.

Your prototype and documentation should be submitted in the form of a single Jupyter or DataBricks notebook exported as HTML or PDF, including the output from executing the code in all the code cells. It should be possible for marking to view in the exported notebook the results of "running" the prototype.

Attachment:- Big Data Platforms.rar

Reference no: EM132598263

Questions Cloud

What the master budget for a given accounting period : What The master budget for a given accounting period has all of the except? It is considered the "grand plan of action" for the upcoming period.
Estimate the required net working capital : Estimate the required net working capital for each year based on sales for the following year. Working capital will be recovered at the end of year 4.
Prepare the journal entries that Jagger would make : Prepare the journal entries that Jagger would make to record: (1) the issuance of the bonds on March 1, 2017; (2) the first interest payment on June
What maintaining a constant production level in a firm : What Maintaining a constant production level in a firm has the advantage of? Meeting customers' changing expectations in terms of demand volume.
MMI223995 Big Data Platforms Assignment : MMI223995 Big Data Platforms Assignment Help and Solution, Glasgow Caledonian University - Assessment Writing Service
Show the journal entry to record the first interest payment : On January 1, 2020, North Country Co issued 10-year, 7 percent bonds with a face value of $1 million. Show journal entry to record the first interest payment
What is jupiter cost of direct materials used during year : Jupiter Inc. had the activities in the year, What is Jupiter's cost of direct materials used during the year? What is Jupiter's cost of goods sold?
Prepare the necessary journal entries : Prepare the necessary journal entries if the wages and salaries paid and the employer payroll taxes are recorded separately
What are consequences of prolonged drug exposure in the cns : What are the 3 possible consequences of prolonged drug exposure in the CNS? References/citations required. The response must be typed.

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd