Reference no: EM133684474
Assignment - Machine Learning Modelling
Citations and important disclosures to students:
The following data and scenario are entirely fictitious. They have been created using synthetic data made to match real world scenarios. The data were created using a range of tools including statistical models of real-world security breaches, AI/ML analysis of real-world security breaches.
Background Context
Given recent attacks on the Healthcare sector, and some noted data breaches, FauxCura Health have engaged Quantum.LogiGuardian (Q.LG), a cyber security consultancy and analytics firm.
FauxCura believes they may have had un-detected cyber security breaches within their systems. As a caring and respectable healthcare provider, they want to examine their historic network data to determine whether undetected breaches have occurred.
The security operations centre at FauxCura run a Security Incident and Event Management (SIEM) platform called SplunkTM. This platform collects vast quantities of log data from servers, desktop computers, routers and other network equipment and aggregates it in the form of reports and alerts that can be viewed by security personnel to identify incidents that require investigation.
During an initial investigation of FauxCura's data, Q.LG were able to trawl through history data to produce an initial report that provides some top-level metrics on incidents based on certain triggers. Incident specific data is also retained, but generally consists of extremely large data sets.
It is hoped that the report data contains sufficient information to be able to construct an ML model that can more accurately identify events of interest.
Data Overview
The data you are working with are records extracted from FauxCura's SIEM. The records have already been processed and reduced to a summary of individual event detections that were triggered by the SIEM.
The data have also been aggregated from multiple other sources and reports in the SIEM. This means some values may be inconsistent across systems or there may be errors in the data that need to be identified and cleaned.
Descriptions of Features:
Below is a brief explanation of the features in the data set. It is not necessary to understand these features. It is also important to note that feature naming conventions are very subjective. Reliance on the meaning of a name may miss important data or detail.
Alert Category (Categorical):
This feature describes what type of alert was created by Splunk. It is largely subjective as the alert creators can identify their own alert levels for different types of events. The levels present in the data can be approximately summarised as:
Informational:
An event that is being logged to the system for information purposes only, it is possible these could relate to malicious activity, but this is the lowest level of alert.
Warning:
This is a higher level of alert and typically used to identify a situation that may not be typical.
Alert:
These are typically used for specific events that represent a security concern that requires action.
NetworkEventType (Categorical):
This is the type of event that the SIEM report believes has occurred. It can be used to differentiate between apparently normal network traffic, to things like policy violations and even threat detections and data exfiltration.
NormalOperation:
No specific anomalies occur in this logged event - there are many reasons this data may be logged.
PolicyViolation:
A security or business policy has been violated. This can range from attempts to run unauthorised software on the network, to using the wrong type of web- browser to access a database.
ThreatDetected:
A specific condition has been detected that has previously been identified as a security thread. These could be normal operations mis-tagged, or they may include malicious software or techniques in use.
NetworkInteractionType (Categorical):
This is another ‘computer' metric that uses an unknown 3rd party plugin to identify network interactions that are not typical.
Regular:
These appear to be normal network traffic requests.
Elevated:
Requests that are attempting to access resources that require specific permissions. For example, a computer trying to log in to an administrative console or a restricted device.
Suspicious:
Generally, these are elevated network events that are unexpected, have come from an unexpected source, or have unexpected patterns of usage.
Anomalous:
Network interactions that aren't typical but may not have any relation to security
events.
Critical:
A network condition that should never occur. This could be an interaction that indicates an attack condition, or a severe equipment outage or malfunction.
Unknown:
The interaction status is unknown
DataTransferVolume (out and in) (Numeric):
Quantifies the amount of data transferred over the network. Values are given whether they are into the network or out of the network.
TransactionsPerSession (Integer):
The number of transactions exchanged between devices and the service they are communicating with.
NetworkAccessFrequency (Integer):
Measures how frequently network ports are accessed, with abnormal frequencies potentially signalling unauthorized access attempts or scans.
UserActivityLevel (Numeric):
A generated metric indicating how active a user is on the system they are connected to. Higher scores generally mean more activity.
SystemAccessRate (Integer):
A generated metric that indicates how frequently the company's core systems are being
accessed.
SessionIntegrityCheck (Logical):
A flag that indicates whether the session has been correctly open, communicated and closed, with all underlying network protocols and signals correctly used.
ResourceUtilizationFlag (Logical):
A flag that is raised when the resource utilisation of servers or network devices is unusually high. This could include excessive memory consumption on some devices, slow response times, or large network transfers.
SecurityRiskLevel (Numeric):
A calculated metric created by a 3rd party "AI" plugin that can identify security risks
based on unknown parameters and conditions.
ResponseTime (milliseconds) (Numeric):
Measures the time taken to respond to network requests or events. This is the time between when a network resource or event occurs, and the corresponding reply packet is returned.
Classification (Categorical):
The final classification of the event. Where indicated "Normal" and "Malicious" can be
assumed to have been identified with reasonable accuracy.
The raw data for the above variables are contained in the HealthCareData_2024.csv file.
The needle in the haystack
The data were gathered over a period of time and processed by several systems in order to associate specific events with confirmed malicious activities. However, the number of confirmed malicious events was very low, with these events accounting for approximately 4% of all logged network events.
Although the malicious events are quite uncommon, the identification of malicious events are extremely important.
Objectives
You are the data scientist that has been hired by Q.LG to examine the data and provide insights. Your goals will be to
Clean the data file and prepare it for Machine Learning (ML)
Recommend a ML algorithm that will provide the most accurate detection of malicious events.
Create a brief report on your findings
You job
Your job is to develop the detection algorithms that will provide the most accurate incident detection. You do not need to concern yourself about the specifics of the SIEM plugin or software integration, i.e., your task is to focus on accurate classification of malicious events using R.
You are to test and evaluate two machine learning algorithms (each in two scenarios) to determine which supervised learning model is best for the task as described.
Task
You are to import and clean the same HealthCareData_2024.csv, that was used in the previous assignment. Then run, tune and evaluate two supervised ML algorithms (each with two types of training data) to identify the most accurate way of classifying malicious events.
Part 1 - General data preparation and cleaning
Import the HealthCareData_2024.csv into R Studio. This version is the same as Assignment 1.
Write the appropriate code in R Studio to prepare and clean the HealthCareData_2024 dataset as follows:
Clean the whole dataset based on the feedback received for Assignment 1.
For the feature NetworkInteractionType, merge the ‘Regular' and
‘Unknown' categories together to form the category ‘Others'. Hint: use the
forcats:: fct_collapse(.) function.
Select only the complete cases using the na.omit(.) function, and name the dataset dat.cleaned.
Briefly outline the preparation and cleaning process in your report and why you believe the above steps were necessary.
Use the code below to generated two training datasets (one unbalanced mydata.ub.train and one balanced mydata.b.train) along with the testing set (mydata.test). Make sure you enter your student ID into the command set.seed(.).
Note that in the master data set, the percentage of malicious events is approximately 4%. This distribution is roughly represented by the unbalanced data. The balanced data is generated based on up-sampling of the minority class using bootstrapping. The idea here is to ensure the trained model is not biased towards the majority class, i.e. normal events.
Part 2 - Compare the performances of different ML algorithms
Randomly select two supervised learning modelling algorithms to test against one another by running the following code. Make sure you enter your student ID into the command set.seed(.). Your 2 ML approaches are given by myModels.
For each of your two ML modelling approaches, you will need to:
Run the ML algorithm in R on the two training sets with Classification as the outcome variable.
Perform hyperparameter tuning to optimise the model:
Outline your hyperparameter tuning/searching strategy for each of the ML modelling approaches. Report on the search range(s) for hyperparameter tuning, which ??-fold CV was used, and the number of repeated CVs (if applicable), and the final optimal tuning parameter values and relevant CV statistics (i.e. CV results, tables and plots), where appropriate. If you are using repeated CVs, a minimum of 2 repeats are required.
If your selected tree model is Bagging, you must tune the nbagg, cp and minsplit hyperparameters, with at least 3 values for each.
If your selected tree model is Random Forest, you must tune the num.trees and mtry hyperparameters, with at least 3 values for each.
Be sure to set the randomisation seed using your student ID.
Evaluate the predictive performance of your two ML models, derived from the balanced and unbalanced training sets, on the testing set. Provide the confusion matrices and report and interpret the following measures in the context of the project:
Overall Accuracy
Precision
Recall
F1-score
Make sure you define each of the above metrics in the context of the study. Hint: Use the help menu in R Studio on the confusionMatrix(.) function to see how one can obtain the precision, recall and F1-score metrics.
Provide a brief statement on your final recommended model and why you have chosen it. This includes explaining which metric(s) you have used in making this decision and why. Parsimony, and to a lesser extent, interpretability maybe taken into account if the decision is close. You may outline your penalised model estimates in the Appendix if it helps with your argument.