Reference no: EM133015544
Task 1: Split the dataset
Write a python script lab5a.py to do the following:
1. You will be using the following packages and should add the imports first: import pandas
2. You have to use the same data set (dataset.csv) from lab 2.Read the data from csv file into a data frame object.
3. Obtain 30 random samples of benginware and save them in a new data frame. Use random.samples() for this,
4. Repeat the last step for malware samples.
5. Combine the two sets obtained in step 3 and 4.
6. Remove these samples from the original data frame based on dataset.csv file. Write the data in a csv file named "reduceddataset.csv". The file will contain all the data for the rest of the samples.
7. Write the data (only the MD5 hash and label columns) from step 5 in a csv file named "samples.csv". The file contains only md5 hash and labels for selected samples.
Task 2: Feature selection using cross-validation
Write a python script called lab5b.py to do the following:
1. You will be using the following packages and should add the imports first:
2. Read the data from "reduceddataset.csv" into a data frame.
3. Add another label called "target" to encode labels as 0 and 1.
4. Extract the features only, i.e. by dropping the MD5 hash, label and target columns.
5. Perform StratifiedKFold to split the data in 10 folds.
6. Declare two lists, one to store the name of the method used and second to store the selected features.
7. Perform the following selections one by one and append the results to the lists:
a. SelecKBest Chi2, code is given below
b. SelecKBest Mutual info
c. SelecKPercentile Mutual info
8. Now we will send these lists to our classifier model. We are using SVC with
kernel=linear
9. Use a for loop to call the function cross_val_score and provide the pipeline and the kfold object . Check the following links, we are looking at most 2-3 lines of code here:
10. Use numpy to calculate the mean and standard deviation on F1 score.
11. Copy the values in the table given and include in your final report.
Task 3: Model Selection using Cross-validation
This task is almost identical to task 2, but we use different models for each case. Write a python script called lab5c.py to do the following:
1. You will be using the following packages and should add the imports first:
2. Repeat step 2 -6 from task 2.
3. Based on my results from the table in task 2, I picked the following feature selection method. You have to look into your table and pick the method and scoring function with highest mean value.
4. Create a for loop to run each model using your pipeline, calculate the scores using cross_val_score. Print the scores and use numpy to calculate the mean and std of the scores. Fill the table in your report. Create a boxplot to display and compare the scores.
Task 4: Live classification and performance evaluation
This task is lengthy than others but most of the work is already done and you have to run the scripts again. You can write one script to do all the work or break it down into 2 or more smaller scripts. I recommend and prefer more than one small scripts (easy to write and debug).
a.) Write a python script called lab5d1.py to do the following:
1. We need to move the samples selected in task 1 (30 malware + 30 benignware) to a different folder. Use the script "scanfiles.py" from lab 2 to scan each file and generate the md5 hash. Match the hashes with your data set "samples.csv" and move those files to a different folder named "samples".
b.) Write a python script called lab5d2.py to do the following:
1. You will be using the following packages and should add the imports first:
2. Load the data from "reduceddataset.csv" into a data frame object.
3. Add another label called "target" to encode labels as 0 and 1.
4. Extract the features only, i.e. by dropping the MD5 hash, label and target columns.
5. Do feature selection using selectKBest, chi2 and k =15
6. Train the classifier. Use SVC with linear kernel.
7. Persist the SVC model in a filed called "saved_detector.pkl". Below is the code that can write a model in a file (we can save model as an object in a file for later use.)
c.) At this point, you can run "detector.py" script given with the lab. You will run it on the folder "samples" with 60 samples you have selected randomly in task 1.The purpose of this script is:
- To analyze the files through capa (same as what we did in lab 2) and check the features found against the list selected to create appropriate vector.
- Run the vector through the machine learning model and classify the samples.
- Write to a log using json format a message that contains the timestamp, the MD5 hash of the file and the classification of the file (malware/benignware).
This is called live classification where we are using a saved model and running it on live files to properly label them as malwares and benignwares. Once this script finished execution check the detector.log file for the results.
d.) Write a python script called lab5d3.py to do the following:
1. You will be using the following packages and should add the imports first:
2. Load the "samples.csv" into a data frame object.
3. Add another label called "target" to encode labels as 0 and 1.
4. Open the "detector.log" file and read all lines into a list.
5. The above list is json objects, we need to convert them into simple strings for further processing. Use json.loads function to do the conversion, make sure to strip newlines characters before calling the load function.
6. Create the true and prediction vectors
i. You need to read the classification from the log generated by live classification (list created in the previous step) and if it is "malware" you can append "1" to y_pred else "0". y_proba are the probability values from the same list.
ii. y_true are the actual label from the samples.csv file.
7. Print the classification report and the confusion matrix using y_pred and y_true values.
8. Create the plot and display it.
9. Create ROC curve and plot it.
Model selection using cross-validation
Objective
In this lab we will use cross-validation (also known as rotation estimation or out-of-sample testing) to evaluate different models.
What to do
To complete this lab, these objectives must be completed:
- Write a report.
- Use cross-validation to evaluate different models.
Tasks
Task 0: Create a report
1. Create a Word document and write the details of your lab completion in it. This will serve as proof that the lab was satisfactorily completed.
2. Each heading should be a task (Task 1, Task 2, etc), with screenshots and descriptions that prove the task was completed satisfactorily. Fill these headings out as you complete the lab.
Task 1: Split the dataset to be used for testing
1. Randomly select 30 malware samples and 30 benign samples from the dataset. These samples will be used to test the model later in the lab.
2. Create a file with the MD5 hash and the true label (malware/benignware) of each sample.
3. Remove the randomly selected samples from the full dataset.
Task 2: Feature selection using cross-validation
1. We will use cross-validation to compare 4 different feature selection strategies. Choose a value for K (the number of features) to be used for the SelectKBest method and a value for percentile for SelectPercentile method.
2. Write a Python script to calculate the cross-validation score for each feature selection strategy using an SVC classificatory.
a) Use StratifiedKFold to split the data in 10 folds.
i. What is the difference between using Kfold and StratifiedFold?
b) Create a pipeline with two steps, the feature selection, and the model.
i. Why do we need to use a pipeline to process the data when using cross-validation?
c) Calculate the cross-validation F1 score. Calculate the mean and the standard deviation of the scores.
3. Fill the table with the values you selected and the results of the cross-validation score.
Method Scoring function K / Percentile F1
Mean Std.
SelectKBest chi2
SelectKBest mutual_info_classif
SelectPercentile chi2
SelectPercentile mutual_info_classif
Task 3: Model Selection using Cross-validation.
1. We will compare the following 4 models and select the best one of them: Decision Tree, Random Forest, SVC, and BernoulliNB. For each model, choose a feature selection strategy.
2. Write a Python script that for each classifier:
a) Split the data in 10 folds using StratifiedKFold.
b) Creates a pipeline with the feature selection and the model.
c) Calculate the cross-validation score using F1 and then calculate the mean and standard deviation.
3. Fill the table and select the best model.
Model Feature Selection F1
Mean Std
Decision Tree
Random Forest
SVC
BernoulliNB
4. Display results as a box plot.
Task 4: Live classification and performance evaluation.
1. For the model selected in the previous task, write a Python script that uses the complete dataset (except the samples that were excluded on Task 1) to:
a. Select the features.
b. Train the classifier.
c. Persist the classifier using pickle.
1. Run the scanfiles.py script created in lab 1 to scan files using the model saved on the monitoring (Ubuntu) server.
2. Download the files to the windows machine from the kali host while the script is running.
3. Create a new Python script that uses the log created by the detector and the true values of the sample to:
a. Create and print the classification report.
b. Display the confusion matrix.
c. Plot the ROC curve.
Attachment:- Lab 1.rar