Reference no: EM133390141
Question: Let us explore the potential of Multiple Kernel Learning (MKL) to both fuse the information from different features and to reduce their dimensionality. This will produce an embedding that we will call the output space.
1 The data used for this section comes from the trials PREDICT-AF and INDEPTH-HCM. Both trials are focused on cardiac pathology, more precisely, on atrial fibrillation and hypertrophic cardiomyopathy respectively. The data was gathered by the cardiology unit at Hospital Cl´inic de Barcelona. More specifically, you will be given data for controls, athletes, hypertensive patients and hypertrophic cardiomyopathy patients. This data consists on: 1) the left ventricular outflow and inflow; 2) the temporal aligning feature; 3) tissue Doppler imaging of the septal wall; 4) six regional strains of the left ventricle; and 5) left atria strain.
2.1 The kernel matrix The data fusion capability of MKL comes from a step known as kernelization. That is, working with a kernel matrix instead of the raw data for each of the features. The kernel matrix is obtained by applying a kernel function to the data. This kernel function can have very different formulations, but in our case, we will use the most popular one, known as the radial basis function (RBF) or Gaussian kernel. Given two points, we can measure their similarity using the RBF kernel defined as: : Km(xim, xjm) = exp -kxim-xjmk 2 2σ2 Here, xim and xjm stand for the data coming from two patients (patient i and j) for a given feature m. We define as the kernel bandwidth, which we set as the average of the pairwise Euclidean distances between each sample and its k-th nearest neighbour (looking at the corresponding feature). In the case that we are working with vectors, we would sum the square difference of each position in the array, divide the result by two sigmas and use the output as exponent for the exponential function. • Apply the kernel function to the different features and visualize the kernels produced. Can you see any structure? (diagonal, regions of colder or hotter color)
2.2 Fusing features with MKL Now that our kernel matrices are ready, we can launch the algorithm itself. • Launch MKL and visualize the output space. Color it by patient type. • Project onto the output space the hypertrophic cardiomyopathy patients. In which region of the space are they positioned? Is this expected? • Color the output space using different clinical variables. Do you observe any correlation?
2.3 Clustering the output space Lastly, to find different strata of patients, we can use clustering techniques to identify subgroups of patients that might have a common phenotype, treatment response, or clinical presentation.
2.3.1 Silhouette method First, we need to quantify which number of clusters should we look for. We know we have 4 different types of patients (controls, athletes, hypertensive and hypertrophic cardiomyopathy), but let us agnostically estimate the optimal number of clusters. There are several techniques for this, but in our case we will use a metric known as the Silhouette method. This metric gives information regarding the cohesion of the clusters and the separation inter-cluster. It uses as input our output space and the range of clusters we wish to evaluate, and will output a score for each of the cluster selections.
2 2.3.2 K-means Once the number of clusters is set, we can launch our clustering algorithm K-means. We will use several replicates to avoid the effect of randomness, since K-means relies on a random initialization. K-means needs as input our output space, the number of clusters and the number of replicates it has to compute. The result is the most frequent solution among the replicates, which is composed of an array with the cluster label of each patient.
2.3.3 Cluster analysis To characterize the patients in each of the clusters, try to visualize the clinical data and the features used for the learning. You can do so in different ways: 1. Creating histograms of the continuous variables for each cluster. 2. Create a table that summarizes the clinical variables for each cluster. 3. Plot the features used for learning separated by cluster. Use different colors for each cluster.