Analyzing the dataset through statistical analysis methods

Assignment Help Data Structure & Algorithms
Reference no: EM133721873

Big Data Analytics

Big Data Analytics using HIVE

  • Providing big data queries using HIVE.
  • Using Built-in (Date, Math, Conditional, and String) Functions in HIVE.
  • Visualizing the results of queries into the graphical representations and be able to interpret them

Big Data Analytics using Spark

Analyzing the dataset through statistical analysis methods.

Designing single- and multi-class classifiers and evaluate and visualize the accuracy/performance

Individual assessment

Find alternative solutions for high level languages and analytics approaches (use references), and Express
findings from big data analytics with the relevant
theories.

Big Data Analytics using Hadoop and Spark

Task:

Understanding Dataset: UNSW-NB15

The raw network packets of the UNSW-NB151 dataset was created by the IXIA PerfectStorm tool in the Cyber Range Lab of the Australian Centre for Cyber Security (ACCS) for generating a hybrid of real modern normal activities and synthetic contemporary attack behaviours.

Tcpdump tool used to capture 100 GB of the raw traffic (e.g., Pcap files). This data set has nine types of attacks, namely, Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms. The Argus and Bro-IDS tools are used and twelve algorithms are developed to generate totally 49 features with the class label.

The features are described here.

The number of attacks and their sub-categories is described here.

In this coursework, we use the total number of 10-million records that was stored in the CSV file (download). The total size is about 600MB, which is big enough to employ big data methodologies for analytics. As a big data specialist, firstly, we would like to read and understand its features, then apply modeling techniques. If you want to see a few records of this dataset, you can import it into Hadoop HDFS, then make a Hive query for printing the first 5-10 records for your understanding.

Big Data Query & Analysis by Apache Hive

This task is using Apache Hive for converting big raw data into useful information for the end users. To do so, firstly understand the dataset carefully. Then, make at least 4 Hive queries (refer to the marking scheme). Apply appropriate visualization tools to present your findings numerically and graphically. Interpret shortly your findings.

Finally, take screenshot of your outcomes (e.g., tables and plots) together with the scripts/ queries into the report.

Tip: The mark for this section depends on the level of your HIVE queries' complexities, for instance using the simple select query is not supposed for full mark.

Advanced Analytics using PySpark
In this section, you will conduct advanced analytics using PySpark.

Analyze and Interpret Big Data

We need to learn and understand the data through at least 4 analytical methods (descriptive statistics, correlation, hypothesis testing, density estimation, etc.). You need to present your work numerically and graphically. Apply tooltip text, legend, title, X-Y labels etc. accordingly to help end-users for getting insights.

Design and Build a Classifier
Design and build a binary classifier over the dataset. Explain your algorithm and its configuration. Explain your findings into both numerical and graphical representations. Evaluate the performance of the model and verify the accuracy and the effectiveness of your model.

Apply a multi-class classifier to classify data into ten classes (categories): one normal and nine attacks (e.g., Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms). Briefly explain your model with supportive statements on its parameters, accuracy and effectiveness.

Individual Assessment

Discuss (1) what other alternative technologies are available for tasks 2 and 3 and how they are differ (use academic references), and (2) what was surprisingly new thinking evoked and/or neglected at your end?

Tip: add individual assessment of each member in a same report.

Documentation

Document all your work. Your final report must follow 5 sections detailed in the "format of final submission" section (refer to the next page). Your work must demonstrate appropriate understanding of academic writing and integrity.

Reference no: EM133721873

Questions Cloud

Which side do you support based on the research : Which side do you support based on the research? Is there common ground? Do you have any criticism on the research that's been conducted?
Increasingly been called on to pay for services for elderly : Which was intended to serve the poor, and poor children in particular, has increasingly been called on to pay for services for the elderly
Briefly describe the key activities of your project focus : MBA 6011 American Public University System- Briefly describe the key activities of your project focus. Explain value your project focus delivers to customers.
Judy passes away from blood infection acquired : She was violently assulted, Judy passes away from a blood infection acquired while receiving follow-up care for pain associated witht he assault.
Analyzing the dataset through statistical analysis methods : CN-7031 Big Data Analytics, East London University - Analyzing the dataset through statistical analysis methods and Designing single- and multi-class
Discuss factors that may inhibit folic acid absorption : Discuss factors that may inhibit folic acid absorption. What are the purposes for the prescribed medications? Discuss client education for vitamin B,
Which data stewards do you interact with on a regular basis : Which data stewards do you interact with on a regular basis in your work setting? What is the scope of your relationship?
What is the developmental impact of such parents : What is the developmental impact of such parents? How much should parents advocate for their children?
Rash on his chest and upper extremities : A 55-year-old man who is HIV positive presents to you with rash on his chest and upper extremities.

Reviews

Write a Review

Data Structure & Algorithms Questions & Answers

  Write an algorithm for testing primality

Write an algorithm for testing primality, i.e. given n, the algorithm must decide if n is a prime

  Convert the following formulas from reverse polish to infix

Convert the following formulas from reverse Polish to infix.

  Section 1 aims objectives and possible outcomesprovide

section 1 aims objectives and possible outcomes.provide a clear statement of the aims and objectives of the data

  Write a program to simulate the behavior of an array

Write a program to simulate the behavior of an m * n array of two-input NAND gates. This circuit, contained on a chip, has j input pins and k output pins.

  Use structures and pointers to create linked lists

Objective: Use structures and pointers to create linked lists. Use knowledge of pointers to modify linked lists. Implement a recursive function.

  Data array a has data series from 1000000 to 1 with step

data array a has data series from 1000000 to 1 with step size 1 which is in perfect decreasing order.data array b has

  Design a flowchart that is also a fully functional program

Using Visual Logic, design a flowchart that is also a fully functional program. Display a clear message for items that are considered.

  Create an asp.net project with visual studio.net

CpCreate an MS Access database called "Members.mdb." Add a table called "tblScores" with the following columns.

  Briefly describe why you think it would be easier

You are involved in a debate with other IT systems analysts. Some of the analysts believe it is harder for the experienced analysts to learn the object modeling technique since they are accustomed to data and process modeling. Briefly describe why..

  Show that an algorithm for election in planar networks exist

Show that an O(N log N) algorithm for election in planar networks exists. Show that there exists an O(N log N) election algorithm for tori without a sense of direction.

  Create a program that should read in character from the file

Your program should read in the characters from the file, but ignore all characters except for the following: { } ( ) [ ]. The general algorithm is to use a stack to store the opening unmatched brackets.

  Identify the advantages of using terminal services

Compare and contrast the Terminal Services model to the mainframe / terminals and client / server models. Consider security, licensing, bandwidth, and network traffic. Decide which model you believe is the best and describe why.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd