Create Background Knowledge for Classification Tasks

Assignment Help Basic Computer Science
Reference no: EM132402314

Assignment - Similarity Assessment, Clustering and Using Clustering to Create Background Knowledge for Classification Tasks

Learning Objectives:

1. Learn to use popular clustering algorithms, namely K-means, K-medoids/PAM and DBSCAN.

2. Learn how to summarize and interpret clustering results.

3. Learn to write R functions which operate on the top of clustering algorithms and clustering results.

4. Learning how to make sense of unsupervised data mining results.

5. Learn how clustering can be used to create useful background knowledge and classification problems.

6. Learn how to create distance function and distance matrices in R and learn about their importance for clustering.

7. You will learn to devise search procedures to solve "finding a needle in a hay stack problems" that are typical for data mining projects.

Assignment Tasks -

0. Transform the Pima dataset into a dataset ZPima by z-scoring the first 8 attributes of the dataset, and copying the 9th attribute of the dataset.

1. Write an R-function entropy (a, b) that computes the entropy and the percentage of outliers of a clustering result based on an apriority given set of class lables, where a gives the assignment of objects in O to clusters, and b contains the class labels of the examples in O. The entropy function H is defined as follows:

Assume we have m classes in our clustering problem; for each cluster Ci we have proportions pi = (pi1,...,pim) of examples belonging to the m different classes (for cluster numbers I = 1,..,k); the entropy of a cluster Ci is computed as follows:

H(pi)= Σj=1 (pij*log2(1/pij)) (H is called the entropy function)

Moreover, if pij=0, pij*log2(1/pij) is defined to be 0.

The entropy of a clustering X is the size-weighted sum of the entropies on the individual clusters:

H(X)= Σr=1 (|Cr|/| Σp|Cp|)*H(pr)

In the above formulas "|...|" represents the set cadinality function.

Moreover, we assume that X={C1,...,Ck} is a clustering with k clusters C1,...,Ck,

You can assume that cluster 0 contains all the outliers, and clusters 1,2,...,k represent "true" clusters; therefore, you should ignore cluster 0 and its instances when computing H(X). The entropy function returns a vector: (<entropy>, <percentage_of_outliers); e.g. if the function returns (0.11, 0.2) this would indicate that the entropy is 0.11, but 20% of the objects in dataset O have been classified as outliers. Run you function for the two test cases, listed at the end of the document, and report the results in your report.

2. Write an R-function wabs-dist(u,v,w) that takes two vectors u, v as an input and computes the distance between u and v as the sum of the absolute weighted differences-using the weights in w.

3. Write an R-function create-dm(x,w) that returns a distance matrix for the objects in data-frame x by calling the wabs-dist(a,b,w) for all pairs of objects a and b belonging to x.

4. Run K-means on the ZPima dataset for k=6 and k=9 and nstart=20; next run PAM for k=6 with distance matrices that have been created using the following weight vectors for the ZPima dataset using the function create-dm-obtaining 3 different PAM clustering results:

a. (1,1,1,1,1,1,1,1)

b. (0.2,1,0,0,0,1,0.2,1)

c. (0,1,0,0,0,1,0,0)

For the 5 obtained clustering results report the overall entropy, the entropy of each cluster, the majority class and the centroid/medoid of each cluster. Next, visualize the clustering result of the K-means run for k=5 and for the 3 PAM results in the Plasma glucose/Body mass index (Attribute 2&6 Space) for the original dataset. Interpret the obtained results! Does changing the distance metric affect the PAM results? Do the results tell you anything which attributes are important for diagnosing diabetes, and about the difficulty diagnosing diabetes?

6. Run DBSCAN on the ZPima dataset; try to find values for MinPoints and epsilon parameters, such that the number of outliers is less than 15% and the number of clusters is between 2 and 12 ; visualize the obtained clustering result in the Plasma glucose/Body mass index attribute Space on the original dataset and report its entropy. Comment on the quality of the obtained clustering result.

7. Run K-means with nstart=20 for k=8 and k=11 for the Complex8 dataset; visualize the results, compute their entropy, and discuss the obtained clustering results and its quality.

8. Next, run DBSCAN for the Complex8 dataset; try to find "good" parameter settings for epsilon and Minpoints such that entropy of the obtained clustering result is maximized and the number of outliers is less than 8% . Develop a search procedure looks for the "best" DBSCAN clustering for the Complex8 dataset by varying the epsilon/Minpoint parameters. Report the best clustering result you found, report the epsilon and Minpoint parameters you used to obtain this clustering result and report its percentage of outliers and entropy. Results with higher entropy will receive higher scores; students who find the "best" result will get extra credit worse **'s! Also compare the DBSCAN results with those you obtained for K-means in Task 7, and assess which clustering algorithm did a better job in clustering the Complex8 dataset!

9. Summarize to which extend the K-Means, PAM, and DBSCAN where able to rediscover the classes in the Complex8 and Pima/ZPima datasets! Moreover, did your expermental results reveal anything interesting about diabetes?

Note - Please use R and Submit - R code for the tasks. The data files needed to run the R codes. The assignment report containing all the plots and results along with the interpretations for them.

Attachment:- Assignment File.rar

Reference no: EM132402314

Questions Cloud

What is the mean deviation : A sample of the personnel files of eight male employees revealed that, during a six-month period, they lost the following number of days due to illness:
Independent sample of 40 non union workers : A random sample of 50 union workers in the retail sales field was taken and their hourly wages was measured. An independent sample of 40 non union workers
Are their stocks fluctuating or holding steady : Post updates about your companies. Have they launched a new product? Have they been involved in a scandal? Are their stocks fluctuating or holding steady?
What is the probability that the delivery time will exceed : The actual delivery time from a pizza delivery company is exponentially distributed with a mean of 27 minutes.
Create Background Knowledge for Classification Tasks : Assignment - Similarity Assessment, Clustering and Using Clustering to Create Background Knowledge for Classification Tasks. Write an R-function entropy
What was the probability of winning : The winner has to win at least 4 games. What was the probability of winning the series if both teams are equally qualified?
Given 2 mutually exclusive events : Given 2 mutually exclusive events (A and B) with probabilities 0.3 and 0.4 respectively.
HOS801 Strategic Management in Tourism and Hospitality : HOS801 Strategic Management in Tourism and Hospitality Assignment Help and Solutions, International College of Management Sydney-Analyse the existing strategy.
Morals-principles-values-corporate social responsibility : Pick one of the following terms for your research: Morals, principles, values, corporate social responsibility, or ethical culture .

Reviews

Write a Review

Basic Computer Science Questions & Answers

  Identifies the cost of computer

identifies the cost of computer components to configure a computer system (including all peripheral devices where needed) for use in one of the following four situations:

  Input devices

Compare how the gestures data is generated and represented for interpretation in each of the following input devices. In your comparison, consider the data formats (radio waves, electrical signal, sound, etc.), device drivers, operating systems suppo..

  Cores on computer systems

Assignment : Cores on Computer Systems:  Differentiate between multiprocessor systems and many-core systems in terms of power efficiency, cost benefit analysis, instructions processing efficiency, and packaging form factors.

  Prepare an annual budget in an excel spreadsheet

Prepare working solutions in Excel that will manage the annual budget

  Write a research paper in relation to a software design

Research paper in relation to a Software Design related topic

  Describe the forest, domain, ou, and trust configuration

Describe the forest, domain, OU, and trust configuration for Bluesky. Include a chart or diagram of the current configuration. Currently Bluesky has a single domain and default OU structure.

  Construct a truth table for the boolean expression

Construct a truth table for the Boolean expressions ABC + A'B'C' ABC + AB'C' + A'B'C' A(BC' + B'C)

  Evaluate the cost of materials

Evaluate the cost of materials

  The marie simulator

Depending on how comfortable you are with using the MARIE simulator after reading

  What is the main advantage of using master pages

What is the main advantage of using master pages. Explain the purpose and advantage of using styles.

  Describe the three fundamental models of distributed systems

Explain the two approaches to packet delivery by the network layer in Distributed Systems. Describe the three fundamental models of Distributed Systems

  Distinguish between caching and buffering

Distinguish between caching and buffering The failure model defines the ways in which failure may occur in order to provide an understanding of the effects of failure. Give one type of failure with a brief description of the failure

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd