Reference no: EM132488935
Activity: Clustering genes (Part B)
Unlike the leukemia data in the first activity, which is very high-dimensional, the YeastGalactose dataset has only moderate dimensionality (20 dimensions), so density-based clustering algorithms may work in this scenario. In this activity we will experiment with the HDBSCAN* algorithm.
Important note: One problem with the HDBSCAN* implementation that we are familiar with, available in the package dbscan , is that the version currently available (when this assignment was prepared) says that "Euclidean distance is required". So, although the theoretical HDBSCAN* model works with any distance, in principle we should not run HDBSCAN* directly with Pearson using this package. Apart from the possible existence of other R implementations of the algorithm that could be used instead, we will stick with the package dbscan here by using a mathematical workaround. Specifically, it can be shown that there is a relation between Pearson similarity and Euclidean distance when the observations are normalised as unit vectors, that is, when the rows of the data matrix are rescaled so that each row is a vector with magnitude one (i.e., length = 1). Clustering the normalised data with Euclidean distance is expected to provide results that are similar to those that would be obtained by clustering the original data with Pearson similarity.
You are asked to:
4. Use the distance matrix as input to call the Single-Linkage clustering algorithm available from the base R package stats and plot the resulting dendrogram. Do not use any class labels to perform this step.
5. Use the distance matrix as input to call the Complete-Linkage clustering algorithm available from the base R package stats and plot the resulting dendrogram. Do not use any class labels to perform this step.
6. Use the distance matrix as input to call the Average-Linkage clustering algorithm available from the base R package stats and plot the resulting dendrogram. Do not use any class labels to perform this step.
7. Use the distance matrix as input to call Ward's clustering algorithm available from the base R package stats and plot the resulting dendrogram. Do not use any class labels to perform this step.
8. Compare the dendrograms plotted in Items 4 to 7. Visually, the dendrograms suggest that some clustering algorithm(s) generate more clear clusters than the others. In your opinion, which algorithm(s) may we be referring to and why? In particular, in which aspects do the results produced by this/these algorithm(s) look more clear? Perform Item 9 below only for this/those algorithm(s).
9. Redraw the dendrogram(s) for the selected algorithm(s) in Item 8, now using the class labels that you stored separately in Item 2 to label the observations (as disposed along the horizontal axis of the dendrogram). Do some prominent clusters in the dendrogram(s) correspond approximately to the classes (that is, the two subtypes of leukemia)?
15. Rescale the 205 x 20 data frame in a row-wise fashion so that each rescaled row has magnitude 1. You can achieve this by dividing each element of a row by the magnitude of the row.
16. Run HDBSCAN* (with Euclidean distance) on the rescaled version of the data frame obtained in Item 15. You can (optionally) try different values for the parameter MinPts, but MinPts = 5 is required. Plot the resulting HDBSCAN* dendrograms with and without the class labels along the horizontal axis, just like in Items 4-9 (Activity 1) and Item 14 (Activity 2).
17. Plot a contingency table. By setting MinPts = 5, the automatic cluster extraction method provided by HDBSCAN* extracts four clusters from the resulting hierarchy. Plot a contingency table of these clusters (labelled '0', 1 ' , ' 2 ' , ' 3 ' and '4', where '0' means objects left unclustered as noise/outliers) against the ground truth class labels that you stored separately in Item 12 (a factor with levels 'cluster1', 'cluster2', `cluster3', 'cluster4').
18. Interpret the contingency table. In particular: (a) What is the best correspondence between the four found clusters and the clusters according to the ground truth, that is, the best association between cluster labels 1 , 2 , 3 and '4' as named by HDBSCAN* and the four known functional categories `cluster'', 'cluster2', 'cluster3' and 'cluster4' as named in the ground truth? (b) What is the functional category for which most genes have been labelled as noise/outliers?
19. Plot the genes grouped by their class labels (that is, functional categories 'dusted', 'cluster2', 'cluster3' and 'cluster4'), in such a way that all the genes belonging to the same class are plotted in a separate sub-figure (four sub-figures in total, each one in a different colour). Plot each gene as a time-series with 20 data points (where each point is connected by lines to its adjacent points in the series).
20. Plot a figure analogous to the one in Item 19, but now with genes grouped in separate sub-figures according to their cluster as assigned by HDBSCAN* (`t, '2', '3' and '4'), rather than by class labels. Do not plot genes that were left unclustered as noise by HDBSCAN* (labelled '0'). Use the best class-to¬cluster association, as in your answer to Item 18, in order to assign each sub-figure of a cluster the same colour used in the sub-figure of the corresponding class in Item 19. For instance, supposing that the best association of class 'clusterX' in the ground truth is with HDBSCAN* cluster 'Y', according to the contingency table in Item 18, then if the genes belonging to class 'clusterX' have been plotted in red in Item 19, then the genes belonging to HDBSCAN* cluster 'Y' should also be plotted in red.
Attachment:- data.rar