Perform an initial principal component analysis

Assignment Help Other Subject
Reference no: EM133365146

Data Cleaning and Visualisation Assignment

Scenario

You have been provided an export from DCE's incident response team's security information and event management (SIEM) system. The incident response team extracted alert data from their SIEM platform and have provided a .CSV file (MLData2023.csv), with 500,000 event records, of which approximately 3,000 have been ‘tagged' as malicious.

The goal is to integrate machine learning into their Security Information and Event Management (SIEM) platform so that suspicious events can be investigated in real-time. security data.

Data description

Each event record is a snapshot triggered by an individual network ‘packet'. The exact triggering conditions for the snapshot are unknown. But it is known that multiple packets are exchanged in a ‘TCP conversation' between the source and the target before an event is triggered and a record created. It is also known that each event record is anomalous in some way (the SIEM logs many events that may be suspicious).

A very small proportion of the data are known to be corrupted by their source systems and some data are incomplete or incorrectly tagged. The incident response team indicated this is likely to be less than a few hundred records. A list of the relevant features in the data is given below.

Assembled Payload Size (continuous)

The total size of the inbound suspicious payload. Note: This would contain the data sent by the attacker in the "TCP

conversation" up until the event was triggered

DYNRiskA Score (continuous)

An un-tested in-built risk score assigned by

a new SIEM plug-in

IPV6 Traffic (binary)

A flag indicating whether the triggering packet was using IPV6 or IPV4 protocols (True = IPV6)

Response Size (continuous)

The total size of the reply data in the TCP

conversation prior to the triggering packet

Source Ping Time (ms) (continuous)

The 'ping' time to the IP address which triggered the event record. This is affected by network structure, number of 'hops' and even physical distances.

 

E.g.:

  • < 1 ms is typically local to the device
  • 1-5ms is usually located in the local network
  • 5-50ms is often geographically local to a country
  • ~100-250ms is trans-continental to servers
  • 250+ may be trans-continental to a small network.

Note, these are estimates only and many

factors can influence ping times.

Operating System (Categorical)

A limited 'guess' as to the operating system that generated the inbound suspicious connection. This is not accurate, but it should be somewhat consistent for each

'connection'

Connection State (Categorical)

An indication of the TCP connection state at

the time the packet was triggered.

Connection Rate (continuous)

The number of connections per second by the inbound suspicious connection made

prior to the event record creation

Ingress Router (Binary)

DCE has two main network connections to the 'world'. This field indicates which

connection the events arrived through

Server Response Packet Time (ms) (continuous)

An estimation of the time from when the payload was sent to when the reply packet was generated. This may indicate server

processing time/load for the event

Packet Size (continuous)

The size of the triggering packet

Packet TTL (continuous)

The time-to-live of the previous inbound packet. TTL can be a measure of how many

'hops' (routers) a packet has traversed before arriving at our network.

Source IP Concurrent Connection (Continuous)

How many concurrent connections were open from the source IP at the time the event was triggered

Class (Binary)

Indicates if the event was confirmed malicious, i.e. 0 = Non-malicious, 1 =

Malicious

The raw data for the above variables are contained in the MLData2023.csv file.

Objectives

The data were gathered over a period of time and processed by several systems in order to associate specific events with confirmed malicious activities. However, the number of confirmed malicious events was very low, with these events accounting for less than 1% of all logged network events.

Because the events associated with malicious traffic are quite rare, rate of ‘false negatives' and ‘false positives' are important.

Your initial goals will be to

• Perform some basic exploratory data analysis
• Clean the file and prepare it for Machine Learning (ML)
• Perform an initial Principal Component Analysis (PCA) of the data.
• Identify features that may be useful for ML algorithms
• Create a brief report to the rest of the research team on your findings

Task
First, copy the code below to a R script. Enter your student ID into the command set.seed(.) and run the whole code. The code will create a sub-sample that is unique to you.

# You may need to change/include the path of your working directory
dat <- read.csv("MLData2023.csv", stringsAsFactors = TRUE)

# Separate samples of non-malicious and malicious events dat.class0 <- dat %>% filter(Class == 0) # non-malicious dat.class1 <- dat %>% filter(Class == 1) # malicious

# Randomly select 300 samples from each class, then combine them to form a working dataset
set.seed(Enter your student ID here)
rand.class0 <- dat.class0[sample(1:nrow(dat.class0), size = 300, replace = FALSE),] rand.class1 <- dat.class1[sample(1:nrow(dat.class1), size = 300, replace = FALSE),]

# Your sub-sample of 600 observations
mydata <- rbind(rand.class0, rand.class1) dim(mydata) # Check the dimension of your sub-sample

Use the str(.) command to check that the data type for each feature is correctly specified. Address the issue if this is not the case.

You are to clean and perform basic data analysis on the relevant features in mydata, and as well as principal component analysis (PCA) on the continuous variables. This is to be done using "R". You will report on your findings.

Part 1 - Exploratory Data Analysis and Data Cleaning

(i) For each of your categorical or binary variables, determine the number (%) of instances for each of their categories and summarise them in a table as follows. State all percentages in 1 decimal places.

  Categorical Feature 

Category           

N (%)       

Feature 1

Category 1

10 (10.0%)

 

Category 2

30 (30.0%)

 

Category 3

50 (50.0%)

 

Missing

10 (10.0%)

Feature 2 (Binary)

YES

75 (75.0%)

 

NO

25 (25.0%)

 

Missing

0 (0.0%)

...

...

...

Feature k

Category 1

25 (25.0%)

 

Category 2

25 (25.0%)

 

Category 3

15 (15.0%)

 

Category 4

30 (30.0%)

 

                                           Missing           

5 (5.0%)     

(ii) Summarise each of your continuous/numeric variables in a table as follows. State all values, except N, to 2 decimal places.

Note: The tables for subparts (i) and (ii) should be based on the original sub- sample of 600 observations, not the cleaned version.

(iii) Examine the results in sub-parts (i) and (ii). Are there any invalid categories/values for the categorical variables? Is there any evidence of outliers for any of the continuous/numeric variables? If so, how many and what percentage are there?

Part 2 - Perform PCA and Visualise Data

(i) For all the observations that you have deemed to be invalid/outliers in Part 1 (iii), mask them by replacing them with NAs using the replace(.) command in R.
(ii) Export your "cleaned" data as follows. This file will need to be submitted along with you report.

#Write to a csv file.

(iii) Extract only the data for the numeric features in mydata, along with Class, and store them as a separate data frame/tibble. Then, filter the incomplete cases (i.e. any rows with NAs) and perform PCA using prcomp(.) in R, but only on the numeric features (i.e. exclude Class).
- Outline why you believe the data should or should not be scaled, i.e. standardised, when performing PCA.
- Outline the individual and cumulative proportions of variance (3 decimal places) explained by each of the first 4 components.
- Outline how many principal components (PCs) are adequate to explain at least 50% of the variability in your data.
- Outline the coefficients (or loadings) to 3 decimal places for PC1, PC2 and PC3, and describe which features (based on the loadings) are the key drivers for each of these three PCs.

(iv) Create a biplot for PC1 vs PC2 to help visualise the results of your PCA in the first two dimensions. Colour code the points with the variable Class. Write a paragraph to explain what your biplots are showing. That is, comment on the PCA plot, the loading plot individually, and then both plots combined (see Slides 28-29 of Module 3 notes) and outline and justify which (if any) of the features can help to distinguish Malicious events.

(v) Based on the results from parts (iii) to (iv), describe which dimension (have to choose one) can assist with the identification of Malicious events (Hint: project all the points in the PCA plot to PC1 axis and see whether there is good separation between the points for Malicious and Non-Malicious events. Then project to PC2 axis and see if there is separation between Malicious and Non-Malicious events, and whether it is better than the projection to PC1.

1. A single report (not exceeding 5 pages, does not include cover page, contents page and reference page, if there is any) containing:
a. summary tables of all the variables in the dataset;
b. a list of data issues (if any);
c. your implementation of PCA and interpretation of the results, i.e. variances explained, and the contribution of each feature for PC1, PC2 and PC3;
d. PC1 vs PC2 biplot and its interpretation;
e. your explanation of selection and contribution of the features with respect to possible identification of Malicious events.

If you use any references in your analysis or discussion outside of the notes provided in the unit, you must cite your sources.

2. The dataset containing your sub-sample of 600 observations, i.e., mydata.
3. A copy of your R code.

Attachment:- Data Cleaning and Visualisation.rar

Reference no: EM133365146

Questions Cloud

What is a good conclusion for this essay prompt : What is a good conclusion for this essay prompt? What is the nature of Odysseus's relationship with Circe? How long do he and his men stay on Aeaea?
How is seth holly language usage of black people relate : How is Seth Holly's language usage of Black people relate in similarity to how Caesar Walker talks about Black people's work ethic in "Gem of the Ocean?"
What was the purpose of each narrative identify : What was the purpose of each narrative? Identify and describe a significant event narrative and explain how it connects with the narrative's overarching theme?
What do you think of lockwood?-the current narrator : What do you think of Lockwood?-the current narrator? Why? How does your knowledge of Emily Bronte's background/life affect your reading of these chapters?
Perform an initial principal component analysis : Data Cleaning and Visualisation Assignment Perform an initial Principal Component Analysis (PCA) of the data. • Identify features that may be useful for ML
Adapt your communication : How can you adapt your communication, assessments, and/or interventions to be equitable and inclusive for the homeless population?
How helps form their harsh opinions of african americans : How do you think their conditions, being in a more comfortable place in society, helps form their harsh opinions of African Americans?
What do the spaces mean in relation to how things develop : How are characters influenced by these spaces and places in the key scene? How might space/place function as a character within the scene? What do the spaces
How polonius is willing to use his daughter for his own gain : which she does not fully understand. He calls Polonius a "fishmonger" which is Elizabethan slang for "pimp" and is appropriate given how Polonius

Reviews

Write a Review

Other Subject Questions & Answers

  Explain why do designers use denormalization

In your own word, explain why do designers use Denormalization? What is the limitation of using Denormalization? Name and explain a better alternative approach.

  What are the most important rules live by

What are the most important rules you live by? What were the most important rules in your family? Could you will those rules to be universal?

  Slavery occupied a central place

From 1492 to 1865, slavery occupied a central place in the lives and minds of Americans. Describe the changing meaning of slavery in each third of the course. Explain how most Colonial Americans (1490-1750) believed that slavery was normal, necessary..

  What is the term used for a common breathing pattern

what is the term used for a common breathing pattern that may occur during the last few days and hours of death. Describe it

  Product developement process

Discuss how you would use the new product developement process if you were thinking about offereing some kind of summer service to residents in a beach resort town.

  Analyze your reflective process

Analyze your reflective process, and discuss your purposes for reflection, the typical amount of time you engage in reflection, the manner of reflection.

  Discuss vincent is discharged after cardiac catheterization

During Vincents hospitalization with heart failure, what would his treatment likely be

  Discuss the treatments or interventions

Discuss the treatments or interventions that have been shown to be the most effective for your selected disorder.

  What are the most common errors in decision making

Compare today's new officers to those who came into the police agency 10, 20 or even 40 years ago. How are they similar? How do they differ?

  Explore the process of gentrification and urban renewal

Explore the process of gentrification and urban renewal. What are the potential effects of gentrification and urban renewal on the economy and current residents

  Similarly enriching cultural immersion activity

After completing one of the recommended activities above, or a similarly enriching cultural immersion activity, summarize briefly

  You need to evaluate anything something like films music

you need to evaluate anything something like films music or books. create a set of criteria for what a good member of

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd