Do classification using the naïve bayes classification model

Assignment Help Other Subject

Reference no: EM133144711

7COM1073 Foundations of Data Science - University of Hertfordshire

Assessment - Data Classification

The programming language you should use to finish this assessment is Python (in version 3 and above). You can use functions from the following packages: Numpy, Pandas, Matplotlib, Seaborn and Sklearn. All Python skills needed to do this assessment have been covered in the practical sessions -practical notes are available on Canvas.

Information on the Data
Fozziwig's Software Developers have contracted you to explore the possibility of an automated software defect prediction system. They want to know if developing such a system would be cost- effective, based on the predictive accuracy that you can achieve with a sample of their data.

Static code metrics are measurements of software features. They can be used to quantify various software properties which may potentially relate to defect-proneness, and thus to code quality. Examples of such properties and how they are often measured include: size, via lines of code (LOC) counts; readability, via operand and operator counts; and complexity, via linearly independent path counts.

The data that you have been given contains the static code metrics for each of the functions which comprise a software system. This system was developed by Fozziwig's Software Developers several years ago. As well as the metrics for each function, it has also been recorded whether or not a fault was experienced in each function. This data came from the software testers who examined the system before it was publicly released.

You have been given two labelled data files, a training data set (trainingSet.csv) and a testing data set (testingSet.csv). Each data set contains 13 features (each one a software metric). Class labels are shown in the last column of each file: a value of `+1' means `defective' (the software module contained a defect (fault)) while a value of `-1' means `non-defective'. Note that this is clearly a simplification of the real world, as both fault quantity and severity have not been taken into account.

Task 1: Data pre-processing and data exploration
a) Use Pandas to load both trainingSet.csv and testingSet.csv.
b) Find the number of patterns in each class for both loaded data sets using Python.
c) Choose an attribute and generate a boxplot for the two classes in the training set.
d) Show one scatter plot, that is, one feature against another feature. It is your choice to show which two features you want to use. You need to use the training set.
e) Divide the original training set into a smaller training set (II) and a validation set. In this task, you need to use 55% of total training data points as the validation set.

Task 2: Do a principal component analysis

a) Perform a PCA analysis on the original training data set.
b) Plot a scree plot to report variances captured by each principal component.
c) Project the test set on the same PCA space produced by the original training dataset.
d) Plot two subplots in one figure: one for the training data in the PC1 and PC2 projection space and label the data in the picture according to its class; the other one for the test data in the same PCA space and label the data in the picture according to its class

Task 3: Do a classification using the Naïve Bayes Classification model
Train the model using the original training set and report the performance on the test set including accuracy rate.

Task 4: Investigate how the number of features in the training dataset affects the model performance on the validation set
a) Use the training set (II) to train 13 Naïve Bayes Classification models, with 13 different feature sets. That is: the first one is to use the 1st feature only; the second one is to use the 1st and the 2nd features; the third one is to use the 1st, 2nd, and 3rd features, the fourth one is to use the first 4 features, and so on.
Measure the accuracy rate on both the training set and the validation set. Report the results by plotting them in a figure: that is, a plot of the accuracy rate against the number of features used in each model. There should be two curves in this figure: one for the training set (II); the other one for the validation set.
b) Report what is the best number of features you would like to use in this work and explain why you choose it. Write it down in your Jupyter notebook.
c) Use the selected number of features to train the model and report the performance on the test set.

Task 5: Summarize your findings, write your conclusions using critical thinking (no more than 100 words) and write it down in your Jupyter notebook.

Attachment:- Data Classification.rar

Reference no: EM133144711

Questions Cloud

Discuss how being employer can cultivate ethical attitude : a) Discuss how being an employer can cultivate an ethical attitude and meet the needs of consumers. You must give examples in your discussion.

How much company total expected maintenance cost : If the RA-20588 company plans to produce 5,000 units the next period, how much would the RA-20588 company's total expected maintenance cost be

Organizational structure at freshii : We'll learn in this video that organizations can be structured very differently, and that depends on the culture.

Designing a market research experiment : When designing a market research experiment, how would you develop hypothesis statements and how would you test such a hypothesis? Provide a rationale for your

Do classification using the naïve bayes classification model : Do a classification using the Naïve Bayes Classification model - Investigate how the number of features in the training dataset affects the model performance

How transportation can advance the social wellbeing : 1. Briefly explain how transportation can advance the social wellbeing of a country?

Build a suitable investment portfolio : Build a suitable Investment Portfolio (Asset Allocation and Product Level). Indicate the account types that you would recommend

Design a plan to engage company employees : A positive employee experience that ensure employees are engaged in their work and what the business is doing is key to the company's success.

What are the top five external influences : What are the top five external influences (factors in the environment external to your entity) that could affect your ability to administer the contract success

User Account

All Pages