Reference no: EM132378065
CSI 5810 - Information Retrieval and Knowledge Discovery
Assignment
1. In this exercise, you will work with Census Income Data Set. Once you have downloaded the data, you will prepare a data visualization report along the lines of visualization done for the Boston Housing data. Feel free to provide any additional visualization that might help in better understanding of the data. Write a paragraph about what characteristics of the data you see via visualization.
2. This exercise is designed to make you familiar with multivariate normal distribution generation and using the generated data.
a. Generate 100 3-dimensional vectors that come from a normal distribution with mean vector as [1 2 1]t and 3x3 covariance matrix as [5 0.8 -0.3; 0.8 3 0.6; -0.3 0.6 4]
b. Make scatter plots of x1 vs x2, x1 vs x3, and x2 vs x3. Explain whatever relationships you can gather from these plots.
c. Pick any 5 pairs of generated vectors and calculate the Euclidean and Mahalanobis distances between those pairs.
3. Consider the following five-dimensional records consisting of attributes 1 to 5.:
Suppose we are interested in reducing the five-dimensional records to two dimensions by means of principal component analysis. List the eigenvalues and eigenvectors obtained via PCA.
Determine the reduced representation for all of the records, and plot the reduced representation in the form of a scatter plot. Reconstruct the original data and compute the reconstruction error.
4. Apply PCA to the Breast Cancer Dataset. and reduce the data to two dimensions[The class labels are not used in PCA]. List all eigenvalues and make a scatter plot of the transformed data. Show transformed malignant and benign data points in different colors or shapes.
5. Repeat Exercise #4 using t-SNE visualization method. Perform visualization with two perplexity values, 10 and 50. Comment on the results obtained.
Attachment:- Assignment.rar