Assignment Document

Regression and Classification Intepretation with WEKA

Pages:

Preview:


  • "In machine learning we need to differentiate between classification and regression first. Clustering isanother aspect, which comes in unsupervised learning. In clustering, we have no clue about the dataset.So we can submit the adhoc clueless dataset..

Preview Container:


  • "In machine learning we need to differentiate between classification and regression first. Clustering isanother aspect, which comes in unsupervised learning. In clustering, we have no clue about the dataset.So we can submit the adhoc clueless dataset to WEKA to do a clustering based on a selected clusteringalgorithm. In here, what actually happen is, via analyzing the similarities in the adhogness, WEKA`sselected algorithm will cluster your dataset into various groups. So, within a created group, data instances inside the group shows some kind of resemblance. Then ifneeded, we could apply, a classification method for this particular group for further pruning of the datawithin the group. For an example the ad-hoc complex dataset could be a mixture of students, employees, dogs and cats.So at the initial stage, if we apply clustering for this, it will separate students, employees, dogs and catsinto separate groups, as earlier it was a messed up mix.Then if we take one group created after clustering (i.e. students) we could apply classifications into it.Where, these students can be classified as good students or bad students.This denotes an important axiom. For classification, there must be a class label.A class label could be‘good ’ and ‘bad ’ for student`s example mentioned above. At the same time, if you`re going to do aprediction of a numeric value, classification will not work. Because, numeric value does not contain aclass label as ‘good ’ and‘bad ’ or etc. Which will bring us to the next important axiom, where, numerical predictions cannot be made via aclassification. It has to be a regression. Via analyzing what are the sales of previous months, if you want to predict the sales as a value,classification won`t work, as we need to proceed with a regression. Regression will try to find the bestpossible fit for the given scenario. Some of the possible regression algorithms to be used would be: -? Gaussian Process? Isotronic Regression.? Least Square method.? Linear Regression.? Simple Linear Regression.? Multilayer Perceptron (Neural Networks)Some of the algorithms given above could work as classifiers or regression models depending on thetype of the dataset provided (i.e.- If the class prediction is a numerical need, it could be a regression,otherwise, a classification.)Next important this to understand is the accuracy statistics. There are two types of accuracy statistics.First is the training accuracy stats and the testing accuracy stats.If we just use the use training sets,from the test options, it will provide you only with the training accuracy, which denotes the trainingsuccess in the model. Normally, in a classification based training, you will get a confusion matrix along with the trainingresults, as mentioned below.Above is the training confusion matrix generated. This denotes, there `re two class labels as functional(a) and non-functional (b). The columns tell you how your model classified your samples - it's what the model predicted:? The first column contains all the samples which your model thinks are "a" – 145 (130+15) of them,total? The second column contains all the samples which your model thinks are "b" - 158 of themThe rows, on the other hand, represent reality:? The first row contains all the samples which really are "a" - 138 of them, total? The second row contains all the samples which really are "b" - 165 of themKnowing the columns and rows, you can dig into the details:? Top left, 130, are things your model thinks are "a" which really are "a" <- these were correct? Bottom left, 15, are things your model thinks are "a" but which are really "b" <- one kind of error? Top right, 8, are things your model thinks are "b" but which really are "a" <- another kind of error? Bottom right, 150 are things your model thinks are "b" which really are "b"So top-left and bottom-right of the matrix are showing things your model gets right.Bottom-left and top-right of the matrix are showing where your model is confused.So in overall, model has made successful 130 of ‘a ’ predictions and 150 of ‘b predictions.Similarly 8 of failed predictions from class ‘a ’ and 15 of failed predictions from class ‘b ’ , which sums upto 23 wrong predictions. So total instances would be (130+8+15+150) =303.So classification accuracy = 303 – 23 ?280/303 * 100?92% In a classification scenario, it`s more important to look into the classification accuracy than these statsmentioned above. Kappa means, how well your classifier has done the job. More it`s closer to the 1performance is good.Remaining all are different representational methods of the same error occurred. These statistics are much important in a regression than a classification.To determine the bestalgorithm.However, in a classification also, this is important, but lesser than the classification accuracy.Though these figures are much low in value, yet the classification accuracy is poor, it`s not a goodclassifier selection.But, in a regression., as you`re not getting a confusion matrix and a classificationaccuracy (instead you will get the co-relation figure) , you need to consider these facts as well.Just consider mean absolute error.Other calculations are also speaking about the same error indifferent magnitudes. Lesser these errors, more the classifier / regression model performance would be. If use obtain those statistics / confusion matrix only for the use training set option, it denotes thetraining accuracy related figures. What is more important is test based stats, which could be obtainedvia, cross validation or percentage split. Usually, more the data, until not exceeding the overfitting limits. Accuracy will improve and error willreduce.As a regression example, we used the CPU.arff from http://storm.cis.fordham.edu/~gweiss/data- mining/weka-data/cpu.arff, where all data is numeric. Below is a screen capture of the arff file.Here theprediction class is also of numeric type, hence the classification will not function due to the absence of aclass label. For the above dataset, several regression algorithms are used with 10 f o ld s’ cross validation. Below isthe correlation coefficient and mean errors obtained. Carefully compare the correlation coefficient andmean error stat combinations and select the best combination."

Why US?

Because we aim to spread high-quality education or digital products, thus our services are used worldwide.
Few Reasons to Build Trust with Students.

128+

Countries

24x7

Hours of Working

89.2 %

Customer Retention

9521+

Experts Team

7+

Years of Business

9,67,789 +

Solved Problems

Search Solved Classroom Assignments & Textbook Solutions

A huge collection of quality study resources. More than 18,98,789 solved problems, classroom assignments, textbooks solutions.

Scroll to Top