Provide an overview of the general field of big data systems

Assignment Help Database Management System
Reference no: EM131959212

Big data tools and techniques Assignment

Assignment must be done with "VM WORKSTATION 14 PLAYER" by connecting to given pen drive from the university since I cannot give you the pen drive in personal I have copy pasted the contents in the folder. I would be writing down the steps to connect to the cloudera software stored in the pen drive.

STEPS:

1) Open VMWARE WORKSTATION PRO 14 PLAYER
2) Click on open virtual machine
3) Open the folder where you have copied the pen drive contents
4) Select cloudera -Training - CAPSark-student-vm folder
5) Select-cloudera-training-capspark-student-rev-dh5.4.3a
6) Click on play virtual machine

Learning outcomes of this assessment

The learning outcomes covered by this assignment are:

• Provide a broad overview of the general field of ‘big data systems'

Developing specialised knowledge in areas that demonstrate the interaction and synergy between ongoing research and practical deployment of this field of study.

Key skills to be assessed

This assignments aims at assessing your skills in:

• The usage of common big data tools and techniques
• Your ability to implement a standard data analysis process

- Loading the data
- Cleansing the data
- Analysis
- Visualisation / Reporting

• Use of Python, SQL and Linux terminal commands

Task

You will be given a dataset and a set of problem statements. Where possible (you will need to carefully explain any reasons for not supplying both solutions), you are required implement the solution in both SQL (using either Hive or Impala), and Spark (using pyspark or spark-shell).

General instructions

You will follow a typical data analysis process:

1. Load / ingest the data to be analysed
2. Prepare / clean the data
3. Analyse the data
4. Visualise results / generate report

For steps 1, 2 and 3 you will use the virtual machine (and the software installed on it) that has been provided as part of this module. The data necessary for this assignment will be provided in a MySQL dump format which you will need to copy onto the virtual machine and start working with it from there.

The virtual machine has a MySQL server running and you will need to load the data into the MySQL server. From there you will be required to use Sqoop to get the data into Hadoop.

For the cleansing, preparation and analysis you will implement the solution twice (where possible). First in SQL using either Hive or Impala and then in Spark using either pyspark or spark-shell.

For the visualisation of the results you are free to use any tool that fulfils the requirements, which can be tools you have learned about such as Python's matplotlib, SAS or Qlik, or any other free open source tool you may find suitable.

Extra features to be implemented

To get more than a "Satisfactory" mark, a number of extra features should be implemented. Features include, but are not limited to:

Creation of a single script that executes the entire process of loading the supplied data to exporting the result data required for visualisation.

• The Spark implementation is done in Scala as opposed to Python.

Usage of parametrised scripts which allows you to pass parameters to the queries to dynamically set data selection criteria. For instance, passing datetime parameters to select tweets in that time period.

Plotting of extra graphs visualising the discovery of useful information based on your own exploration which is not covered by the other problem statements.

• Extraction of statistical information from the data.
• The usage of file formats other than plain text.

The data

You will be given a dataset containing simplified Twitter data pertaining to a number of football games. The dataset will be supplied in compressed format and will be made available online for download or can be supplied by USB memory stick. Further information regarding each game, including the teams playing and their official hashtags, start and end times, as well as the times of any goals, will also be provided.

Problem statements

You are a data analyst / data scientist working for an event security company who monitor real time events to analyse the level of potential disturbance. In order to asses commotion at an event, they monitor the Twitter feeds pertaining to the event. They would like answers to the following questions (in all the following, you should consider the half time and overtime as ‘during-game')..

Questions / problem statements:

1. Extract and present the average number of tweets per ‘during-game' minute for the top 10 (i.e. most tweeted about during the event) games.

2. Rank the games according to number of distinct users tweeting ‘during-game' and present the infor- mation for the top 10 games, including the number of distinct users for each.

3. Find the top 3 teams that played in the most games. Rank their games in order of highest number of ‘during-game' tweets (include the frequency in your output).

4. Find the top 10 (ordered by number of tweets) games which have the highest ‘during-game' tweeting spike in the last 10 minutes of the game.

5. As well as the official hashtags, each tweet may be labelled with other hashtags. Restricting the data to ‘during-game' tweets, list the top 10 most common non official hashtags over the whole dataset with their frequencies.

6. Draw the graph of the progress of one of the games (the game you choose should have a complete set of tweets for the entire duration of the game). It may be useful to summarize the tweet frequencies in 1 minute intervals.

Report

A 4000-5000 word report that documents your solution.

Additional advice to the client will award marks above the "Satisfactory". This could include but is not limited to:

• Other findings based on your analysis of the data
• Outline of algorithms which would extract further information from the data
• Discussion of alternative visualizations that could prove useful

Along with the report, you are expected to also fill in a self-assessment form.

Verified Expert

As per instructions the python code is written using pyspark commands and the questions are answered running the commands of pyspark in jupyter console notebook and all questions are answered with screenshots hence the task is complete according to all specifications

Reference no: EM131959212

Questions Cloud

What amount of premium would be amortized for the year ended : On January 1, 2016, the Keller Co. issued $140,000 of 20-year 8% bonds for $172,000. Interest was payable annually.
Management accounting report for tech UK limited : H - 508 - Management Accounting - produce a written report as part of your learning which will also be circulated to all the department managers
What is the net gain from your transaction : What is the total cost, including commission, if you have to cover the short sale by buying the stock at a price of $21.50 per share?
What volatility smile would you expect to see : A company's stock is selling for $6.50. The company has no outstanding debt. Analysts consider the liquidation value of the company to be at least $550,000,000.
Provide an overview of the general field of big data systems : Provide a broad overview of the general field of big data systems. You will be given a dataset and a set of problem statements.
Calculate the amount of the annual rental payment : On January 1, 2017, Coronado Company leased equipment to Whispering Corporation. The following information pertains to this lease.
What is the maximum pension benefit : A company's defined benefit pension plan utilizes a funding formula that considers years of service and average compensation to determine the pension benefit.
In what year is the gross income reportable : Janet won a legal settlement against her employer for $50,000, the result of gender discrimination claims against the employer.
What will the initial cash flow be for the project : Your company has spent $200000 on research to develop a new computer game. The firm is planning to spend $250000 on a machine to produce the new game.

Reviews

inf1959212

10/31/2018 2:31:11 AM

Hi there. Many thanks for the solution I have received the assessment even before the given deadline. I hope that the solution is perfect as it seems after first look. I must say that there is no issues of price when you ask expertsmind. Just submit your task with all the relevant informations and forget about the quality what else a student require more. Service is always great as usual. thanks

inf1959212

10/31/2018 2:28:44 AM

Spark comes in a number of flavours for different APIs. For me he needs to use Pyspark (i.e. the Python API). The dataset is a compressed MySQL database with seven tables in it. The steps he will need to perform are as follows: 1. Import the tables (probably using Sqoop)from MySQL to a directory on the Hadoop Distributed File System (HDFS); the expert will need to do this on his system but I don't need to see it because I have already done it on my system 2. Use PySpark commands to read the files from the HDFS into RDDs 3. Use Pyspark commands to create Paired RDDs from the RDDs 4. Use Pyspark commands to manipulate the paired RDDs (Joins, Aggregations etc) to address the 6 questions and produce the results specified in my previous attachment What I need from the expert is the Pyspark commands and the results from steps 4 to 6 above. I don't need the actual code - because that will probably be specific to the expert's environment and would need to be tailored for my environment. I just need the Pyspark commands and the results that they have produced on the expert's system.

inf1959212

10/31/2018 2:27:55 AM

The Spark part of the solution must be in PySpark, not Scala. Many Thanks Hi, The Spark part of the solution needs to be written in PySpark not Scala. I don't need the 4k-5K report and I don't require the SQL part of the solution. All I require is the PySpark part of the solution. The attachment explains in more detail. Could that be delivered by Wednesday? 29876596_1PySpark RDD Assignment.doc

Write a Review

Database Management System Questions & Answers

  Elaborate further on the definition of information systems

Review some of the tools you use whether at home or school that helps turn the data we are exposed to into information for making decisions.

  How referential integrity constraint prevent data

In physical database design, referential integrity constraints can be defined. What actions does referential integrity constraint prevent from happening when data is inserted in table which contains this constraint?

  Create relational schema of database in 3nf

A Relational schema of your database in 3NF, clearly indicating attributes, the data type of each attribute, primary and foreign keys, candidate keys, and which attributes are nullable, giving reasons. List any assumptions you need to make.

  Create an erd that shows the entities and attributes

Create an ERD that shows the entities, attributes, relationships, cardinality and optionality that describe the booking of a room by a guest. This ERD is to be labelled ERD 1 - Add the entities, attributes, relationships, cardinality and optionalit..

  Olivias mountain adventure store

Visual Studio Express to build a Web-site with Access database - Olivias Mountain Adventure Store (OMAS)

  Digitalx has been operating a chain of retail stores

digitalx has been operating a chain of retail stores selling cds dvds and games for a number of years. recently they

  Knowledge and data warehousing

Knowledge and Data Warehousing

  Prove that your algorithm correctly computes the attribute

Describe a linear-time (in the size of the set of FDs, where the size of each FD is the number of attributes involved) algorithm for ?ndingthe attribute closure of a set of attributes with respect to a set of FDs

  Developing a database model

developing a database model that will support company operations. focusing on understanding the business and its functional areas or business processes

  Research the various database management system products

Research the various database management system (DBMS) products available for your scenario and compare the top contenders to highlight similarities.

  Designing a relational database

What are the main differences between designing a relational database and an object database. Illustrate by examples.

  1 give syntax example for each of the following group

1. give syntax example for each of the following group functionsavgsumminmaxcountdistinctstddevvariance2. provide 2

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd