Reference no: EM133115360
CPSC 552 Cyber Forensics - California State University Fullerton
Problem # 1:
a) Program the Naïve Baye's classifier to determine if a message is spam or ham (not spam). Code is provided in this document.
In this problem, you will use the SMS Spam collection dataset which appears as:
As you can see, the first part indicates if the message is ham or spam, and the second part (after the tab) has the message. The spam detection implementation involves breaking the dataset into train and test sets. Typically, we use 80% of the data for training and the remaining 20% for testing. From the training set, we build a list of unique words (after removing punctuation characters). From the words in the message, the Naïve Baye's algorithm does a likelihood estimation for the spam and the not spam cases. Whichever likelihood is higher, the message is classified accordingly.
a) If the above program (during the test phase) encounters a word that is not in its vocabulary, how does it handle it?
b) The above Naïve Baye's spam filtering program uses Laplace smoothing. Explain briefly why it is needed. What is a good value for alpha for it. Which value of alpha results in the highest test accuracy?
c) What are the strengths and weaknesses of the Naïve Bayes algorithm.
d) If you were hired by a marketing firm to send spam emails, how would you compose the messages to fool the Naïve Baye's spam detector.
e) If there are many features (high dimensional data), then because of multiplication of probabilities involved in computing p(Y|w1,w2,...wn), the result can become zero. Discuss, how this can be overcome in Naïve Baye's classification.
Problem #2:
a) Suppose, the following 2-d data is given to you that you need to divide into 2 clusters using GMM.
If a data point d=(3,8) is given to you, and you decided to use (7,7) as the mean for first cluster, and (8,8) for the second cluster. In the very beginning of the GMM algorithm, what will be the likelihood p(c1|d) and p(c2|d). Show your hand calculations (you can use calculator).
b) For the multivariate GMM algorithm, the initial means for the Gaussian components are selected randomly by picking one of the data items to be the mean of a category. Thus, there is a probability that the initial means belong to the same class, or one of the class's mean is not chosen in the beginning. Can the GMM still converge and create meaningful clusters of the classes involved. Set the initial clusters to be the same class for the Iris dataset in your implementation and show the resulting clustering.
c) How did we initialize the mean and covariance in each cluster in Assignment 2?. Search to see if there is a better starting strategy for selecting the initial means and covariances. Explain the strategy briefly.
d) For high dimensional data, the GMM algorithm can be a problem as it needs to compute the inverse and determinant of the covariance matrix. Despite the advantages of GMM, such as its probabilistic interpretation and robustness against observation noise, maximum-likelihood estimation for GMM does not perform well in high-dimensional setting. If you are given high dimensional data (e.g., cancer dataset for 5 types of cancer), describe, how will you reduce the dimensionality of data before applying GMM.
e) How does GMM compare with the simple K means clustering? Briefly describe the ideas behind Kmeans++ and K-medoid algorithms.
Problem #3: For data visualization, we have discussed four algorithms in the class. Principle Component Analysis (PCA) is one where we use two principle components to visualize the data in two dimensions. In Assignment 4, you programmed the PCA from scratch. The PCA is also pre programmed in the sklearn library. Create a new project called DataVisualization. Add a file called Utils.py with the following code in it.
a) sklearn library provides the MDS algorithm which can be called similar to the code for pca, tsne, or umap. Add a file to the project called mds.py where you write the code for MDS visualization and show the resulting plot for the visualization of the cancer dataset.
b) If you compare the PCA visualization with the MDS visualization in 2-D, both do not appear to separate the data into their own class clusters. Why PCA and MDS have similarity in this behavior? Verify on the MNIST dataset as well. Note that the code for PCA, TSNE, UMAP has already commented lines for testing the visualization on the MNIST dataset, you just need to uncomment the lines.
c) TSNE has the parameter of perplexity, and UMAP has parameters of n_neighbors and min_distance. Briefly explain the purpose of these and then determine the best perplexity for the MNIST and cancer datasets. Similarly, determine the best UMAP parameters for the cancer MNIST and cancer datasets, and show the plots.
d) In your opinion, which of the four algorithms performs best on the MNIST dataset and which one performs worst, and why?. Similarly, which of the four algorithms performs best on the cancer dataset and which one performs worst, and why?
Problem #4: What is the difference between SVD and Eigen decomposition. Briefly explain by describing the steps of the two algorithms.
Problem # 5: Write a 2-3 page summary of the given paper
Attachment:- Cyber Forensics.rar