Create Background Knowledge for Classification Tasks

Assignment Help Basic Computer Science

Reference no: EM132402314

Assignment - Similarity Assessment, Clustering and Using Clustering to Create Background Knowledge for Classification Tasks

Learning Objectives:

1. Learn to use popular clustering algorithms, namely K-means, K-medoids/PAM and DBSCAN.

2. Learn how to summarize and interpret clustering results.

3. Learn to write R functions which operate on the top of clustering algorithms and clustering results.

4. Learning how to make sense of unsupervised data mining results.

5. Learn how clustering can be used to create useful background knowledge and classification problems.

6. Learn how to create distance function and distance matrices in R and learn about their importance for clustering.

7. You will learn to devise search procedures to solve "finding a needle in a hay stack problems" that are typical for data mining projects.

Assignment Tasks -

0. Transform the Pima dataset into a dataset ZPima by z-scoring the first 8 attributes of the dataset, and copying the 9th attribute of the dataset.

1. Write an R-function entropy (a, b) that computes the entropy and the percentage of outliers of a clustering result based on an apriority given set of class lables, where a gives the assignment of objects in O to clusters, and b contains the class labels of the examples in O. The entropy function H is defined as follows:

Assume we have m classes in our clustering problem; for each cluster C_i we have proportions p_i = (p_i1,...,p_im) of examples belonging to the m different classes (for cluster numbers I = 1,..,k); the entropy of a cluster C_i is computed as follows:

H(p_i)= Σ_j=1 (p_ij*log₂(1/p_ij)) (H is called the entropy function)

Moreover, if p_ij=0, p_ij*log₂(1/p_ij) is defined to be 0.

The entropy of a clustering X is the size-weighted sum of the entropies on the individual clusters:

H(X)= Σ_r=1 (|C_r|/| Σ_p|C_p|)*H(p_r)

In the above formulas "|...|" represents the set cadinality function.

Moreover, we assume that X={C₁,...,C_k} is a clustering with k clusters C₁,...,C_k,

You can assume that cluster 0 contains all the outliers, and clusters 1,2,...,k represent "true" clusters; therefore, you should ignore cluster 0 and its instances when computing H(X). The entropy function returns a vector: (<entropy>, <percentage_of_outliers); e.g. if the function returns (0.11, 0.2) this would indicate that the entropy is 0.11, but 20% of the objects in dataset O have been classified as outliers. Run you function for the two test cases, listed at the end of the document, and report the results in your report.

2. Write an R-function wabs-dist(u,v,w) that takes two vectors u, v as an input and computes the distance between u and v as the sum of the absolute weighted differences-using the weights in w.

3. Write an R-function create-dm(x,w) that returns a distance matrix for the objects in data-frame x by calling the wabs-dist(a,b,w) for all pairs of objects a and b belonging to x.

4. Run K-means on the ZPima dataset for k=6 and k=9 and nstart=20; next run PAM for k=6 with distance matrices that have been created using the following weight vectors for the ZPima dataset using the function create-dm-obtaining 3 different PAM clustering results:

a. (1,1,1,1,1,1,1,1)

b. (0.2,1,0,0,0,1,0.2,1)

c. (0,1,0,0,0,1,0,0)

For the 5 obtained clustering results report the overall entropy, the entropy of each cluster, the majority class and the centroid/medoid of each cluster. Next, visualize the clustering result of the K-means run for k=5 and for the 3 PAM results in the Plasma glucose/Body mass index (Attribute 2&6 Space) for the original dataset. Interpret the obtained results! Does changing the distance metric affect the PAM results? Do the results tell you anything which attributes are important for diagnosing diabetes, and about the difficulty diagnosing diabetes?

6. Run DBSCAN on the ZPima dataset; try to find values for MinPoints and epsilon parameters, such that the number of outliers is less than 15% and the number of clusters is between 2 and 12 ; visualize the obtained clustering result in the Plasma glucose/Body mass index attribute Space on the original dataset and report its entropy. Comment on the quality of the obtained clustering result.

7. Run K-means with nstart=20 for k=8 and k=11 for the Complex8 dataset; visualize the results, compute their entropy, and discuss the obtained clustering results and its quality.

8. Next, run DBSCAN for the Complex8 dataset; try to find "good" parameter settings for epsilon and Minpoints such that entropy of the obtained clustering result is maximized and the number of outliers is less than 8% . Develop a search procedure looks for the "best" DBSCAN clustering for the Complex8 dataset by varying the epsilon/Minpoint parameters. Report the best clustering result you found, report the epsilon and Minpoint parameters you used to obtain this clustering result and report its percentage of outliers and entropy. Results with higher entropy will receive higher scores; students who find the "best" result will get extra credit worse **'s! Also compare the DBSCAN results with those you obtained for K-means in Task 7, and assess which clustering algorithm did a better job in clustering the Complex8 dataset!

9. Summarize to which extend the K-Means, PAM, and DBSCAN where able to rediscover the classes in the Complex8 and Pima/ZPima datasets! Moreover, did your expermental results reveal anything interesting about diabetes?

Note - Please use R and Submit - R code for the tasks. The data files needed to run the R codes. The assignment report containing all the plots and results along with the interpretations for them.

Attachment:- Assignment File.rar

Reference no: EM132402314

Questions Cloud

What is the mean deviation : A sample of the personnel files of eight male employees revealed that, during a six-month period, they lost the following number of days due to illness:

Independent sample of 40 non union workers : A random sample of 50 union workers in the retail sales field was taken and their hourly wages was measured. An independent sample of 40 non union workers

Are their stocks fluctuating or holding steady : Post updates about your companies. Have they launched a new product? Have they been involved in a scandal? Are their stocks fluctuating or holding steady?

What is the probability that the delivery time will exceed : The actual delivery time from a pizza delivery company is exponentially distributed with a mean of 27 minutes.

Create Background Knowledge for Classification Tasks : Assignment - Similarity Assessment, Clustering and Using Clustering to Create Background Knowledge for Classification Tasks. Write an R-function entropy

What was the probability of winning : The winner has to win at least 4 games. What was the probability of winning the series if both teams are equally qualified?

Given 2 mutually exclusive events : Given 2 mutually exclusive events (A and B) with probabilities 0.3 and 0.4 respectively.

HOS801 Strategic Management in Tourism and Hospitality : HOS801 Strategic Management in Tourism and Hospitality Assignment Help and Solutions, International College of Management Sydney-Analyse the existing strategy.

Morals-principles-values-corporate social responsibility : Pick one of the following terms for your research: Morals, principles, values, corporate social responsibility, or ethical culture .

User Account

All Pages