Analyze and interpret big data using pyspark

Assignment Help Data Structure & Algorithms
Reference no: EM132735663

CN7031 Big Data Analytics - University of East London

Academic Year

This coursework (CRWK) must be attempted in the groups of 4 or 5 students. This coursework is divided into two sections: (1) Big Data analytics on a real case study and (2) group presentation. All the group members must attend the presentation. Presentation would be online through Microsoft Teams. If you do not turn up in the presentation date with the video call, you will fail the module.

Overall mark for CRWK comes from two main activities as follows:
1- Big Data Analytics report (around 3,000 words, with a tolerance of ± 10%) in HTML format
2- Presentation

(1) Understanding Dataset: CSE-CIC-IDS20181

This dataset was originally created by the University of New Brunswick for analyzing DDoS data. You can find the full dataset and its description here. The dataset itself was based on logs of the university's servers, which found various DoS attacks throughout the publicly available period to generate totally 80 attributes with 6.40GB size. We will use about 2.6GB of the data to process it with the restricted PCs to 4GB RAM. Download it from here. When writing machine learning or statistical analysis for this data, note that the Label column is arguably the most important portion of data, as it determines if the packets sent are malicious or not.
a) The features are described in the "IDS2018_Features.xlsx" file in Moodle page.
b) The labels are as follows:

• "Label": normal traffic
• "Benign": susceptible to DoS attack
c) In this coursework, we use more than 8.2-million records with the size of 2.6GB. As a big data specialist, firstly, we should read and understand the features, then apply modeling techniques. If you want to see a few records of this dataset, you can either use [1] Hadoop HDFS and Hive, [2] Spark SQL or [3] RDD for printing a few records for your understanding.

2) Big Data Query & Analysis using Spark SQL
This task is using Spark SQL for converting big sized raw data into useful information. Each member of a group should implement 2 complex SQL queries (refer to the marking scheme). Apply appropriate visualization tools to present your findings numerically and graphically. Interpret shortly your findings.

• What do you need to put in the HTML report per student?
1. At least two Spark SQL queries.
2. A short explanation of the queries.
3. The working solution, i.e., plot or table.

• Tip: The mark for this section depends on the level of your queries complexity, for instance using the simple select query is not supposed for a full mark.

(3) Advanced Analytics using PySpark
In this section, you will conduct advanced analytics using PySpark.

3.1. Analyze and Interpret Big Data using PySpark
Every member of a group should analyze data through 3 analytical methods (e.g., advanced descriptive statistics, correlation, hypothesis testing, density estimation, etc.). You need to present your work numerically and graphically. Apply tooltip text, legend, title, X-Y labels etc. accordingly.
Note: we need a working solution without system or logical error for the good/full mark.

3.2. Design and Build a Machine Learning (ML) technique
Every member of a group should go over and apply one ML technique. You can apply one the following approaches: Classification, Regression, Clustering, Dimensionality Reduction, Feature Extraction, Frequent Pattern mining or Optimization. Explain and evaluate your model and its results into the numerical and/or graphical representations.

Note: If you are 4 students in a group, you should develop 4 different models. If you have a similar model, the mark would be zero.

(4) Documentation
Your final report must follow the "The format of final submission" section. Your work must demonstrate appropriate understanding of building a user friendly, efficient and comprehensive analytics report for a big data project to help move users (readers) around to find the relevant contents.

Attachment:- Big Data Analytics.rar

Reference no: EM132735663

Questions Cloud

Record the journal entry for this transaction : The Reid Beverages Corporation acquired a new truck that had a list price of $50 000 on January 1, 2015. They traded in their old truck.
What was the direct labor rate variance for the month : The actual labor rate per hour for the month was $21.50. The company used a total of 1,780 labor hours. What was the direct labor rate variance for the month?
What is the target cost per unit if selling price is reduced : What is the target cost per unit if the selling price is reduced to $6.00 and the company wants to maintain its same income level?
Importance of operations management : What is the importance of operations management?
Analyze and interpret big data using pyspark : Analyze and Interpret Big Data using PySpark and Design and Build a Machine Learning (ML) technique - What do you need to put in the HTML report per student?
Define the ethical challenge : Explain: What is the accountant required to do in response to an ethical challenge Outline: The action/s, KD should take or not take. Give reasons.
Determine course schedule for fall semester : An undergraduate business major is attempting to determine her course schedule for the fall semester. She is considering five courses, which are shown
Discuss the different types of government audits : The U.S. Government Accountability Office (GAO) has established a formal system for issuing government auditing standards and related interpretations.
Find what is contribution margin per unit for oslo company : What is the contribution margin per unit? (Round your answer to 2 decimal places.)

Reviews

Write a Review

Data Structure & Algorithms Questions & Answers

  Write a program for implementing the fcfs scheduling

Write a program for implementing the FCFS Scheduling algorithm? Write a program for simulation of SJF Scheduling algorithm?

  Cache memory mapping function

Consider a cache consisting of 256 blocks of 16 words each, for a total of 4096(4k) words and assume that the main memory is addressable by a 16 bit address and it contains of 4k blocks. How many bits are there in each of the TAG, BLOCK/SET and WORD ..

  Determine the shortest path from a speci?ed vertex

Recall that it builds a tree of shortest paths from the speci?ed vertex one edge at a time - Determine the shortest path from a speci?ed vertex

  Declare a double array

Question 1: Declare a double array of size 100. Question 2: Fill out the array with 1 if a random value is greater than 0.5 else 0 Question 3: print the number of 0's in the array .

  How can you describe the use of implode and explode

Discuss about client-side validation and server-side validation. How would you do validation to protect web page from malicious attack? How can you describe the use of implode() and explode()?

  Implementing a simple spell checking program

Implementing a simple spell checking program using binary search trees. One of the most-used applications of computers today is checking spelling. In this question, you will load a large dictionary (approximately 173,529 words) into a binary searc..

  Describe the search techniques used by the crawlers

Describe the search techniques used by the crawlers and spiders in different search engines on the Web.

  Determine the objective of a query simplifier describe the

question 1 what is the objective of a query simplifier? what are the idempotence rules used by query simplifier? give

  Write a function with the heading function nonodes

Write a function with the heading: function NoNodes( t : treeptr) : natural whose value is the number of nodes on the tree t.

  What does it mean that a function is effectively calculable

CSD3203 – History and Philosophy of Computing - What does it mean that a function is effectively calculable? Which functions are effectively calculable?

  Q1 consider the hire assistant problem we interview n

q1 consider the hire assistant problem. we interview n candidates and always hire the best qualified so far. let n 5

  Write an algorithm that displays the squares of the number

Using a FOR loop,I need to write an algorithm that displays the squares of the number 1 to 10to console out put

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd