Do classification using the naïve bayes classification model

Assignment Help Other Subject
Reference no: EM133144711

7COM1073 Foundations of Data Science - University of Hertfordshire

Assessment - Data Classification

The programming language you should use to finish this assessment is Python (in version 3 and above). You can use functions from the following packages: Numpy, Pandas, Matplotlib, Seaborn and Sklearn. All Python skills needed to do this assessment have been covered in the practical sessions -practical notes are available on Canvas.

Information on the Data
Fozziwig's Software Developers have contracted you to explore the possibility of an automated software defect prediction system. They want to know if developing such a system would be cost- effective, based on the predictive accuracy that you can achieve with a sample of their data.

Static code metrics are measurements of software features. They can be used to quantify various software properties which may potentially relate to defect-proneness, and thus to code quality. Examples of such properties and how they are often measured include: size, via lines of code (LOC) counts; readability, via operand and operator counts; and complexity, via linearly independent path counts.

The data that you have been given contains the static code metrics for each of the functions which comprise a software system. This system was developed by Fozziwig's Software Developers several years ago. As well as the metrics for each function, it has also been recorded whether or not a fault was experienced in each function. This data came from the software testers who examined the system before it was publicly released.

You have been given two labelled data files, a training data set (trainingSet.csv) and a testing data set (testingSet.csv). Each data set contains 13 features (each one a software metric). Class labels are shown in the last column of each file: a value of `+1' means `defective' (the software module contained a defect (fault)) while a value of `-1' means `non-defective'. Note that this is clearly a simplification of the real world, as both fault quantity and severity have not been taken into account.

Task 1: Data pre-processing and data exploration
a) Use Pandas to load both trainingSet.csv and testingSet.csv.
b) Find the number of patterns in each class for both loaded data sets using Python.
c) Choose an attribute and generate a boxplot for the two classes in the training set.
d) Show one scatter plot, that is, one feature against another feature. It is your choice to show which two features you want to use. You need to use the training set.
e) Divide the original training set into a smaller training set (II) and a validation set. In this task, you need to use 55% of total training data points as the validation set.

Task 2: Do a principal component analysis

a) Perform a PCA analysis on the original training data set.
b) Plot a scree plot to report variances captured by each principal component.
c) Project the test set on the same PCA space produced by the original training dataset.
d) Plot two subplots in one figure: one for the training data in the PC1 and PC2 projection space and label the data in the picture according to its class; the other one for the test data in the same PCA space and label the data in the picture according to its class

Task 3: Do a classification using the Naïve Bayes Classification model
Train the model using the original training set and report the performance on the test set including accuracy rate.

Task 4: Investigate how the number of features in the training dataset affects the model performance on the validation set
a) Use the training set (II) to train 13 Naïve Bayes Classification models, with 13 different feature sets. That is: the first one is to use the 1st feature only; the second one is to use the 1st and the 2nd features; the third one is to use the 1st, 2nd, and 3rd features, the fourth one is to use the first 4 features, and so on.
Measure the accuracy rate on both the training set and the validation set. Report the results by plotting them in a figure: that is, a plot of the accuracy rate against the number of features used in each model. There should be two curves in this figure: one for the training set (II); the other one for the validation set.
b) Report what is the best number of features you would like to use in this work and explain why you choose it. Write it down in your Jupyter notebook.
c) Use the selected number of features to train the model and report the performance on the test set.

Task 5: Summarize your findings, write your conclusions using critical thinking (no more than 100 words) and write it down in your Jupyter notebook.

Attachment:- Data Classification.rar

Reference no: EM133144711

Questions Cloud

Discuss how being employer can cultivate ethical attitude : a) Discuss how being an employer can cultivate an ethical attitude and meet the needs of consumers. You must give examples in your discussion.
How much company total expected maintenance cost : If the RA-20588 company plans to produce 5,000 units the next period, how much would the RA-20588 company's total expected maintenance cost be
Organizational structure at freshii : We'll learn in this video that organizations can be structured very differently, and that depends on the culture.
Designing a market research experiment : When designing a market research experiment, how would you develop hypothesis statements and how would you test such a hypothesis? Provide a rationale for your
Do classification using the naïve bayes classification model : Do a classification using the Naïve Bayes Classification model - Investigate how the number of features in the training dataset affects the model performance
How transportation can advance the social wellbeing : 1. Briefly explain how transportation can advance the social wellbeing of a country?
Build a suitable investment portfolio : Build a suitable Investment Portfolio (Asset Allocation and Product Level). Indicate the account types that you would recommend
Design a plan to engage company employees : A positive employee experience that ensure employees are engaged in their work and what the business is doing is key to the company's success.
What are the top five external influences : What are the top five external influences (factors in the environment external to your entity) that could affect your ability to administer the contract success

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd