Reference no: EM133527293
Assignment:
Part I: Algorithm Comparison
1. Investigate the difference in model performance using statistical significance testing. We will compare three models (decision tree J48, 5-NearestNeighbor and OneR) on two different data sets (diabetes.arff and breast-cancer.arff), and perform a pairwise comparison of the models on each data set (total of six paired experiments separately or run everything at the same time).
2. Choose 10 folds cross-validation as your experiment type and repeat 5 times on each pair.
3. For both data sets, compare the performance of all three algorithms using a paired t-test.
For each model, describe parameter settings/design decisions you make in acquiring your data (so that your experiments are replicable).
4. You can collect accuracy estimates using the Experimenter in WEKA, dumping the results into a CSV file and picking the appropriate column of the file. You will need to implement the paired permutation test yourself.
5. Does any one of the algorithms work significantly differently on either one of the two datasets from another algorithm? Report your findings. You should use screenshots, calculations, and analysis to support your conclusions.
6. For each pair of algorithms that you find to perform significantly differently, calculate the p-value of the paired t-test to support your finding.
Part II: Cost Analysis
In this part you will replicate my cost-analysis demonstration in class using the dataset "breastcancer.arff". First, generate the classifier output using J48. Next, make changes to the weights associated with certain types of errors based on the following rules and run your J48 classifier again.
a. Cost of "recurrence events" being wrongly classified as "no recurrence events" is 4
b. Cost of "no recurrence events" being wrongly classified as "recurrence events" remain 1
Questions:
1. Please show me the total cost before and after you apply the cost weights.
2. Please show me the confusion matrix before and after you apply the cost weights.
3. If you further change the cost of "recurrence events" being wrongly classified as "no recurrence events" to 10, how will the algorithm be affected? Is the algorithm stillpractical? Why or why not?