Reference no: EM132402314
Assignment - Similarity Assessment, Clustering and Using Clustering to Create Background Knowledge for Classification Tasks
Learning Objectives:
1. Learn to use popular clustering algorithms, namely K-means, K-medoids/PAM and DBSCAN.
2. Learn how to summarize and interpret clustering results.
3. Learn to write R functions which operate on the top of clustering algorithms and clustering results.
4. Learning how to make sense of unsupervised data mining results.
5. Learn how clustering can be used to create useful background knowledge and classification problems.
6. Learn how to create distance function and distance matrices in R and learn about their importance for clustering.
7. You will learn to devise search procedures to solve "finding a needle in a hay stack problems" that are typical for data mining projects.
Assignment Tasks -
0. Transform the Pima dataset into a dataset ZPima by z-scoring the first 8 attributes of the dataset, and copying the 9th attribute of the dataset.
1. Write an R-function entropy (a, b) that computes the entropy and the percentage of outliers of a clustering result based on an apriority given set of class lables, where a gives the assignment of objects in O to clusters, and b contains the class labels of the examples in O. The entropy function H is defined as follows:
Assume we have m classes in our clustering problem; for each cluster Ci we have proportions pi = (pi1,...,pim) of examples belonging to the m different classes (for cluster numbers I = 1,..,k); the entropy of a cluster Ci is computed as follows:
H(pi)= Σj=1 (pij*log2(1/pij)) (H is called the entropy function)
Moreover, if pij=0, pij*log2(1/pij) is defined to be 0.
The entropy of a clustering X is the size-weighted sum of the entropies on the individual clusters:
H(X)= Σr=1 (|Cr|/| Σp|Cp|)*H(pr)
In the above formulas "|...|" represents the set cadinality function.
Moreover, we assume that X={C1,...,Ck} is a clustering with k clusters C1,...,Ck,
You can assume that cluster 0 contains all the outliers, and clusters 1,2,...,k represent "true" clusters; therefore, you should ignore cluster 0 and its instances when computing H(X). The entropy function returns a vector: (<entropy>, <percentage_of_outliers); e.g. if the function returns (0.11, 0.2) this would indicate that the entropy is 0.11, but 20% of the objects in dataset O have been classified as outliers. Run you function for the two test cases, listed at the end of the document, and report the results in your report.
2. Write an R-function wabs-dist(u,v,w) that takes two vectors u, v as an input and computes the distance between u and v as the sum of the absolute weighted differences-using the weights in w.
3. Write an R-function create-dm(x,w) that returns a distance matrix for the objects in data-frame x by calling the wabs-dist(a,b,w) for all pairs of objects a and b belonging to x.
4. Run K-means on the ZPima dataset for k=6 and k=9 and nstart=20; next run PAM for k=6 with distance matrices that have been created using the following weight vectors for the ZPima dataset using the function create-dm-obtaining 3 different PAM clustering results:
a. (1,1,1,1,1,1,1,1)
b. (0.2,1,0,0,0,1,0.2,1)
c. (0,1,0,0,0,1,0,0)
For the 5 obtained clustering results report the overall entropy, the entropy of each cluster, the majority class and the centroid/medoid of each cluster. Next, visualize the clustering result of the K-means run for k=5 and for the 3 PAM results in the Plasma glucose/Body mass index (Attribute 2&6 Space) for the original dataset. Interpret the obtained results! Does changing the distance metric affect the PAM results? Do the results tell you anything which attributes are important for diagnosing diabetes, and about the difficulty diagnosing diabetes?
6. Run DBSCAN on the ZPima dataset; try to find values for MinPoints and epsilon parameters, such that the number of outliers is less than 15% and the number of clusters is between 2 and 12 ; visualize the obtained clustering result in the Plasma glucose/Body mass index attribute Space on the original dataset and report its entropy. Comment on the quality of the obtained clustering result.
7. Run K-means with nstart=20 for k=8 and k=11 for the Complex8 dataset; visualize the results, compute their entropy, and discuss the obtained clustering results and its quality.
8. Next, run DBSCAN for the Complex8 dataset; try to find "good" parameter settings for epsilon and Minpoints such that entropy of the obtained clustering result is maximized and the number of outliers is less than 8% . Develop a search procedure looks for the "best" DBSCAN clustering for the Complex8 dataset by varying the epsilon/Minpoint parameters. Report the best clustering result you found, report the epsilon and Minpoint parameters you used to obtain this clustering result and report its percentage of outliers and entropy. Results with higher entropy will receive higher scores; students who find the "best" result will get extra credit worse **'s! Also compare the DBSCAN results with those you obtained for K-means in Task 7, and assess which clustering algorithm did a better job in clustering the Complex8 dataset!
9. Summarize to which extend the K-Means, PAM, and DBSCAN where able to rediscover the classes in the Complex8 and Pima/ZPima datasets! Moreover, did your expermental results reveal anything interesting about diabetes?
Note - Please use R and Submit - R code for the tasks. The data files needed to run the R codes. The assignment report containing all the plots and results along with the interpretations for them.
Attachment:- Assignment File.rar