Research and compare various techniques for organizing data

Assignment Help Applied Statistics
Reference no: EM132357664

Big Data Technologies Assignment - Data Lake Architecture

In this assignment you will explore the management of big data using Data Lake technology. This Assessment Task relates to the following Learning Outcomes:

  • Obtain a high level of technical competency in standard and advanced methods for big data technologies.
  • Understand the current status of and recognize future trends in big data technologies.
  • Develop a competency with emerging big data technologies, applications and tools.

Part 1 - Data Lake Components

In the lecture, you have been introduced to the high-level concepts of the whys and whats of a Data Lake. The goal of this assignment is to take a deep dive into the architecture of Data Lake and provide a Design Patterns for the problem of dealing with organizing a collection of datasets that holds a vast amount of data gathered from various private/open data islands. Your design should include the specification of the following components in some details:

Data Ingestion Component:

a. You need to research and identify the different types of data (from structured to unstructured) and data ingest (e.g., batch, micro-batch, real-time), and briefly explain them.

b. Identify the existing Big Data Technologies and Tools for ingesting big data, e.g., Hortonworks DataFlow.

Data Organization Component:

a. You need to research and compare various techniques for organizing data, e.g., Directory Structure, Version Control and Database Management Systems.

b. Identify the existing Database Management Systems for each category, e.g. MySQL in Relational DBs and MongoDB in NoSQL document-oriented DBs.

Data Security and Governance Component:

a. You need to research and identify the requirements for governing the right data access and the rights for defining and modifying data.

b. Identify the existing trust, security, and privacy issues in Big Data.

Indexing and Search Component:

a. You need to research on the topic "Federated Search" topic and identify technologies that facilitates the simultaneous search of multiple searchable resources.

b. Identify the existing Big Data Technologies and Tools for indexing and searching the big data: e.g., Elasticsearch and some research outcomes.

Analytics Component:

a. You need to research and compare the techniques for analysing the data (from structured to unstructured) and extracting insight from them.

b. Identify the existing Big Data Technologies and Tools for analysing the big data: SAS Tools (such as SAS Text-Analytics), Microsoft ML platform, Amazon ML Platform, and Apache Mahoot.

Visualization Component:

a. You need to research and identify the techniques for visualizing the data.

b. Identify the existing Big Data Technologies and Tools for visualizing the big data: e.g. SAS10 Visual Analytics. Other examples include D3.JS and VIS.JS.

Part 2 - Data Lake Architecture

Design Patterns are formalized best practices that one can use to solve common problems when designing a system. Refer to the Data Lake components in Part 1, and propose a Data Lake architecture for the problem of graph search in big graph databases. Read the following papers to gain an understanding of a typical Data Lake architecture and a graph based search:

1. A. Beheshti, B. Benatallah, R. Nouri, V. Chhieng, H. Xiong, and X. Zhao, CoreDB: a Data Lake Service. Conference on Information and Knowledge Management (CIKM) 2017.

2. G. Sun, G. Liu, Y. Wang, M. A. Orgun, and X. Zhou: Incremental Graph Pattern based Node Matching, IEEE International Conference on Data Engineering (ICDE) 2018.

Attachment:- Big Data Technologies Assignment File.rar

Reference no: EM132357664

Questions Cloud

When dealing with public health policy and laws : When dealing with public health policy and laws, we need to consider various aspects of the economy.
Explain how stimulation of other types of receptors : Explain how stimulation of other types of receptors around a pain receptor can make pain appear less
Talk about reasons why a juvenile may go to suicide : Talk about reasons why a juvenile may go to suicide. At that point, utilizing one of these hypotheses, propose how to help a youngster who has been a casualty
Change is one of the hardest things for people to experience : Change is one of the hardest things for people to experience. Where would you say that your classmates/coworkers fall on this scale?
Research and compare various techniques for organizing data : ITEC874 Big Data Technologies Assignment - Data Lake Architecture, Macquarie University, Australia. Research and compare various techniques for organizing data
Explain what power issues may arise from the scenario : Can you please give me an idea to explain what power issues may arise from the scenario, and What factors influence statistical power
Confident speaker encourage stronger advocacy : How can becoming a confident speaker encourage stronger advocacy skills for themselves? Likewise, how does maintaining self-control encourage better listening?
Annotated bibliography-construct literature review : Using your annotated bibliography, construct a literature review. Provide your reader with a broad base of understanding of the research topic.
Distinguish between a theory hypothesis and operational : Please help me distinguish between a theory, a hypothesis, and an operational definition.

Reviews

len2357664

8/14/2019 12:44:18 AM

What to Submit: A single file (word or pdf) with the name “YourStudentNo+ITEC874A1”. Total Mark: 100 - Part 1. Data Lake Components (60 Marks) For Part 1, you will need to provide 5 tools/technologies for part a and b of each component. You will need to provide references for the tools and papers. You will need to briefly (not more than 1 paragraph) discuss and explain each tool/technology. Part 2. Data Lake Architecture (40 Marks) For Part 2, you will need to draw the architecture (you can use any preferred tools) and provide details of your proposed architecture in no more than 2 pages (including the proposed architecture and the details).

len2357664

8/14/2019 12:44:11 AM

Marking Guideline: Part 1. (60 Marks, 10 Marks for Each Question) [2.5 Marks]. You need to list the name of each tool/technology/method. 0.5 marks for each. If you provide 5 or more, you get the full mark, i.e., 2.5 marks. [7.5 Marks]. You need to give a comprehensive explanation for each tool/technology/method. 1.5 marks for each. You need to explain what it is, how it works, and when it is used (at least three sentences). If you miss one aspect, you will lose 0.5marks. If you provided 5 or more, you will get the full mark, i.e., 7.5 marks.

len2357664

8/14/2019 12:44:03 AM

Part 2. (40 Marks) [15 Marks]. Draw the data lake architecture: You need to use a tool to draw the architecture, e.g., Visio and OmniGraffle, etc. [9 Marks]. You need to draw all the 6 data lake components discussed in Part 1 (1.5 marks for each, totally 9 marks). [4 Marks]. You need to draw the relations between the component. [2 Marks]. You need to provide a clear layout and a figure with a high resolution. [25 Marks]. The details of your proposed architecture. [18 Marks]. You need to provide the details of each of the 6 components (3 marks for each, totally 18 marks). You need to explain what it is, how it works, and when it is used (at least three sentences). If you miss one aspect, you will lose 1 mark. [6 Marks].

len2357664

8/14/2019 12:43:56 AM

You need to provide a description of the workflow of the architecture. Namely, when a query comes, how the data lake works, including the input and output for each of the 6 components (1 marks for each, totally 6 marks). [1 Marks]. You need to provide a well-structured description, e.g., you could use a bulleted list to organize the structure of the paragraphs, or use bold and/or italic fonts to highlight some contents.

len2357664

8/14/2019 12:43:51 AM

Late Submission: No extensions will be granted without an approved application for Special Consideration. There will be a deduction of 10% of the total available marks (10 marks for the assignment, scale to 1 mark in your final grade) made from the total awarded mark for each 24 hour period or part thereof that the submission is late. For example, 25 hours late in submission for this assignment– 20% penalty (20 marks deducted, scale to 2 marks in your final grade). No submission will be accepted after solutions have been posted.

Write a Review

Applied Statistics Questions & Answers

  Find the cumulative distribution function

Find the cumulative distribution function

  Compares the distributions of animals relative to the distri

Resource-selection analysis compares the distributions of animals relative to the distribution of habitat. If the two don't agree, there is evidence of selection. A survey of 106 moose found that 24 were located in "In burn - interior," 22 in "..

  Liquid products were first obtained from coal

1). Liquid products were first obtained from coal in England during the 1700s. Lamp oil was produced from coal in the United States as early as 1850, but the domestic coal chemicals industry did not develop until World War I. A modern coal - for - re..

  Approximate the probability that at most will be defective

A manufacturing process produces semiconductor chips with a known failure rate of .If a random sample of chips is selected, approximate the probability that at most will be defective.Use the normal approximation to the binomial with a correction for ..

  Write a regular expressions that captures all html tags

The Enron scandal led to the bankruptcy of the Enron Corporation, Find all sentences that include the term kenneth lay, ignoring cases

  How would you describe the correlation

MM207 Final Project Assignment. Create and paste in a scatterplot that compares Final Exam Score and Project Score. What is the correlation (r-value)? How would you describe the correlation (positive, negative, strong, weak, medium, none)? Include..

  A random variable x follows a normal distribution

A random variable X follows a normal distribution with standard deviation 13.  A random sample of 30 individuals is selected from the population, and a confidence interval for  is calculated to be (87.348, 96.652).  What is the confidence level for t..

  What would be the best estimate of correlation coefficient

Elaborated with examples and supportive references as to how can index numbers be helpful in understanding time-related data - what would be your best estimate of the correlation coefficient?

  Determine the point estimate of the population proportion

Determine the point estimate of the population proportion, the margin of error for the following confidence interval, and the number of individuals in sample

  A binomial population gives a value of p^=0.46

Suppose a random sample of 100 observations from a binomial population gives a value of p^=0.46 and you wish to test the null hypothesis that the population parameter p is equal to 0.50 against the alternative hypothess that p is greater than 0.50a) ..

  Find a formula in each case for the probability

The ith switch in each of the following circuits is closed with probability p, and open with probability qi for each i.

  Regardless of an individual iq level

Regardless of an individual IQ level, SES, or their history, it is imperative that every person has an opportunity to get a great education.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd