Draw the graph of the progress of one of the games

Assignment Help Database Management System
Reference no: EM132165113

Database Management System Big data tools and techniques Assignment -

Assignment must be done with "VM WORKSTATION 14 PLAYER" by connecting to given pen drive from the university since I cannot give you the pen drive in personal I have copy pasted the contents in the folder. I would be writing down the steps to connect to the cloudera software stored in the pen drive.

STEPS:

1) Open VMWARE WORKSTATION PRO 14 PLAYER

2) Click on open virtual machine

3) Open the folder where you have copied the pen drive contents

4) Select cloudera -Training - CAPSark-student-vm folder

5) Select-cloudera-training-capspark-student-rev-dh5.4.3a

6) Click on play virtual machine

Learning outcomes of this assessment - The learning outcomes covered by this assignment are:

Provide a broad overview of the general field of 'big data systems'

Developing specialised knowledge in areas that demonstrate the interaction and synergy between ongoing research and practical deployment of this field of study.

Key skills to be assessed -

This assignments aims at assessing your skills in:

The usage of common big data tools and techniques

Your ability to implement a standard data analysis process

  • Loading the data
  • Cleansing the data
  • Analysis
  • Visualisation / Reporting

Use of Python, SQL and Linux terminal commands

Task - You will be given a dataset and a set of problem statements. Where possible (you will need to carefully explain any reasons for not supplying both solutions), you are required implement the solution in both SQL (using either Hive or Impala), and Spark (using pyspark or spark-shell).

General instructions

You will follow a typical data analysis process:

1. Load / ingest the data to be analysed

2. Prepare / clean the data

3. Analyse the data

4. Visualise results / generate report

For steps 1, 2 and 3 you will use the virtual machine (and the software installed on it) that has been provided as part of this module. The data necessary for this assignment will be provided in a MySQL dump format which you will need to copy onto the virtual machine and start working with it from there.

The virtual machine has a MySQL server running and you will need to load the data into the MySQL server. From there you will be required to use Sqoop to get the data into Hadoop.

For the cleansing, preparation and analysis you will implement the solution twice (where possible). First in SQL using either Hive or Impala and then in Spark using either pyspark or spark-shell.

For the visualisation of the results you are free to use any tool that fulfils the requirements, which can be tools you have learned about such as Python's matplotlib, SAS or Qlik, or any other free open source tool you may find suitable.

Extra features to be implemented

To get more than a "Satisfactory" mark, a number of extra features should be implemented. Features include, but are not limited to:

Creation of a single script that executes the entire process of loading the supplied data to exporting the result data required for visualisation.

  • The Spark implementation is done in Scala as opposed to Python.

Usage of parametrised scripts which allows you to pass parameters to the queries to dynamically set data selection criteria. For instance, passing datetime parameters to select tweets in that time period.

Plotting of extra graphs visualising the discovery of useful information based on your own exploration which is not covered by the other problem statements.

  • Extraction of statistical information from the data.
  • The usage of file formats other than plain text.

The data

You will be given a dataset containing simplified Twitter data pertaining to a number of football games. The dataset will be supplied in compressed format and will be made available online for download or can be supplied by USB memory stick. Further information regarding each game, including the teams playing and their official hashtags, start and end times, as well as the times of any goals, will also be provided.

Problem statements

You are a data analyst / data scientist working for an event security company who monitor real time events to analyse the level of potential disturbance. In order to asses commotion at an event, they monitor the Twitter feeds pertaining to the event. They would like answers to the following questions (in all the following, you should consider the half time and overtime as 'during-game')..

Questions / problem statements:

1. Extract and present the average number of tweets per 'during-game' minute for the top 10 (i.e. most tweeted about during the event) games.

2. Rank the games according to number of distinct users tweeting 'during-game' and present the infor- mation for the top 10 games, including the number of distinct users for each.

3. Find the top 3 teams that played in the most games. Rank their games in order of highest number of 'during-game' tweets (include the frequency in your output).

4. Find the top 10 (ordered by number of tweets) games which have the highest 'during-game' tweeting spike in the last 10 minutes of the game.

5. As well as the official hashtags, each tweet may be labelled with other hashtags. Restricting the data to 'during-game' tweets, list the top 10 most common non official hashtags over the whole dataset with their frequencies.

6. Draw the graph of the progress of one of the games (the game you choose should have a complete set of tweets for the entire duration of the game). It may be useful to summarize the tweet frequencies in 1 minute intervals.

Report - A 4000-5000 word report that documents your solution.

Attachment:- Assignment Files.rar

Reference no: EM132165113

Questions Cloud

What is dippin dots competitive strategy : What role did entrepreneurial strategy and the management of innovation play in establishing Dippin' Dots' iniital competitive edge?
Create a pert diagram that identifies the critical path : Create a PERT diagram that identifies the critical path. Take a screenshot of the PERT chart which can later be inserted into your written paper.
Discuss the exertion of one of the sources of power : Select four people currently well known in the USA media and discuss their exertion of one of the sources of power. Students must cover all four of the sources.
How would the gdp of the american economy be affected : Would investments and foreign trade rates increase or decrease? How would the GDP of the American economy be affected?
Draw the graph of the progress of one of the games : Database Management System Big data tools and techniques Assignment - Draw the graph of the progress of one of the games
Draw a diagram to highlight the product road map : Draw a diagram to highlight the product road map and product life cycle through the use of graphical tools in Visio, or an open source alternative such as Dia.
How analytics and cloud technology could align : Create a workflow diagram to illustrate how analytics and cloud technology could align with the company's business processes.
What are model deployment costs : What are model deployment costs? Be specific. What is a proposed task and timeline for deploying your model?
Disscuss the law applied : Gene and Martha Jannusch owned Festival Foods, which served snacks at events throughout Illinois and Indiana.

Reviews

len2165113

11/13/2018 1:40:19 AM

All I require is the PySpark part of the solution. Don't need the 4k-5K report and I don't require the SQL part. The attachment explains in more detail. Could that be delivered? Note - Please provide Specification of Each Solution with in 4-5 Lines. Additional advice to the client will award marks above the "Satisfactory". This could include but is not limited to: Other findings based on your analysis of the data and Outline of algorithms which would extract further information from the data. Discussion of alternative visualizations that could prove useful. Along with the report, you are expected to also fill in a self-assessment form. The Spark part of the solution needs to be written in PySpark not Scala.

Write a Review

Database Management System Questions & Answers

  Knowledge and data warehousing

Design a dimensional model for analysing Purchases for Adventure Works Cycles and implement it as cubes using SQL Server Analysis Services. The AdventureWorks OLTP sample database is the data source for you BI analysis.

  Design a database schema

Design a Database schema

  Entity-relationship diagram

Create an entity-relationship diagram and design accompanying table layout using sound relational modeling practices and concepts.

  Implement a database of courses and students for a school

Implement a database of courses and students for a school.

  Prepare the e-r diagram for the movie database

Energy in the home, personal energy use and home energy efficiency and Efficient use of ‘waste' heat and renewable heat sources

  Design relation schemas for the entire database

Design relation schemas for the entire database.

  Prepare the relational schema for database

Prepare the relational schema for database

  Data modeling and normalization

Data Modeling and Normalization

  Use cases perform a requirements analysis for the case study

Use Cases Perform a requirements analysis for the Case Study

  Knowledge and data warehousing

Knowledge and Data Warehousing

  Stack and queue data structure

Identify and explain the differences between a stack and a queue data structure

  Practice on topic of normalization

Practice on topic of Normalization

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd