Reference no: EM133144711
7COM1073 Foundations of Data Science - University of Hertfordshire
Assessment - Data Classification
The programming language you should use to finish this assessment is Python (in version 3 and above). You can use functions from the following packages: Numpy, Pandas, Matplotlib, Seaborn and Sklearn. All Python skills needed to do this assessment have been covered in the practical sessions -practical notes are available on Canvas.
Information on the Data
Fozziwig's Software Developers have contracted you to explore the possibility of an automated software defect prediction system. They want to know if developing such a system would be cost- effective, based on the predictive accuracy that you can achieve with a sample of their data.
Static code metrics are measurements of software features. They can be used to quantify various software properties which may potentially relate to defect-proneness, and thus to code quality. Examples of such properties and how they are often measured include: size, via lines of code (LOC) counts; readability, via operand and operator counts; and complexity, via linearly independent path counts.
The data that you have been given contains the static code metrics for each of the functions which comprise a software system. This system was developed by Fozziwig's Software Developers several years ago. As well as the metrics for each function, it has also been recorded whether or not a fault was experienced in each function. This data came from the software testers who examined the system before it was publicly released.
You have been given two labelled data files, a training data set (trainingSet.csv) and a testing data set (testingSet.csv). Each data set contains 13 features (each one a software metric). Class labels are shown in the last column of each file: a value of `+1' means `defective' (the software module contained a defect (fault)) while a value of `-1' means `non-defective'. Note that this is clearly a simplification of the real world, as both fault quantity and severity have not been taken into account.
Task 1: Data pre-processing and data exploration
a) Use Pandas to load both trainingSet.csv and testingSet.csv.
b) Find the number of patterns in each class for both loaded data sets using Python.
c) Choose an attribute and generate a boxplot for the two classes in the training set.
d) Show one scatter plot, that is, one feature against another feature. It is your choice to show which two features you want to use. You need to use the training set.
e) Divide the original training set into a smaller training set (II) and a validation set. In this task, you need to use 55% of total training data points as the validation set.
Task 2: Do a principal component analysis
a) Perform a PCA analysis on the original training data set.
b) Plot a scree plot to report variances captured by each principal component.
c) Project the test set on the same PCA space produced by the original training dataset.
d) Plot two subplots in one figure: one for the training data in the PC1 and PC2 projection space and label the data in the picture according to its class; the other one for the test data in the same PCA space and label the data in the picture according to its class
Task 3: Do a classification using the Naïve Bayes Classification model
Train the model using the original training set and report the performance on the test set including accuracy rate.
Task 4: Investigate how the number of features in the training dataset affects the model performance on the validation set
a) Use the training set (II) to train 13 Naïve Bayes Classification models, with 13 different feature sets. That is: the first one is to use the 1st feature only; the second one is to use the 1st and the 2nd features; the third one is to use the 1st, 2nd, and 3rd features, the fourth one is to use the first 4 features, and so on.
Measure the accuracy rate on both the training set and the validation set. Report the results by plotting them in a figure: that is, a plot of the accuracy rate against the number of features used in each model. There should be two curves in this figure: one for the training set (II); the other one for the validation set.
b) Report what is the best number of features you would like to use in this work and explain why you choose it. Write it down in your Jupyter notebook.
c) Use the selected number of features to train the model and report the performance on the test set.
Task 5: Summarize your findings, write your conclusions using critical thinking (no more than 100 words) and write it down in your Jupyter notebook.
Attachment:- Data Classification.rar