CPSC 552 Cyber Forensics Assignment

Assignment Help Basic Computer Science
Reference no: EM133115360

CPSC 552 Cyber Forensics - California State University Fullerton

Problem # 1:

a) Program the Naïve Baye's classifier to determine if a message is spam or ham (not spam). Code is provided in this document.

In this problem, you will use the SMS Spam collection dataset which appears as:

As you can see, the first part indicates if the message is ham or spam, and the second part (after the tab) has the message. The spam detection implementation involves breaking the dataset into train and test sets. Typically, we use 80% of the data for training and the remaining 20% for testing. From the training set, we build a list of unique words (after removing punctuation characters). From the words in the message, the Naïve Baye's algorithm does a likelihood estimation for the spam and the not spam cases. Whichever likelihood is higher, the message is classified accordingly.

a) If the above program (during the test phase) encounters a word that is not in its vocabulary, how does it handle it?
b) The above Naïve Baye's spam filtering program uses Laplace smoothing. Explain briefly why it is needed. What is a good value for alpha for it. Which value of alpha results in the highest test accuracy?
c) What are the strengths and weaknesses of the Naïve Bayes algorithm.
d) If you were hired by a marketing firm to send spam emails, how would you compose the messages to fool the Naïve Baye's spam detector.
e) If there are many features (high dimensional data), then because of multiplication of probabilities involved in computing p(Y|w1,w2,...wn), the result can become zero. Discuss, how this can be overcome in Naïve Baye's classification.

Problem #2:
a) Suppose, the following 2-d data is given to you that you need to divide into 2 clusters using GMM.

If a data point d=(3,8) is given to you, and you decided to use (7,7) as the mean for first cluster, and (8,8) for the second cluster. In the very beginning of the GMM algorithm, what will be the likelihood p(c1|d) and p(c2|d). Show your hand calculations (you can use calculator).

b) For the multivariate GMM algorithm, the initial means for the Gaussian components are selected randomly by picking one of the data items to be the mean of a category. Thus, there is a probability that the initial means belong to the same class, or one of the class's mean is not chosen in the beginning. Can the GMM still converge and create meaningful clusters of the classes involved. Set the initial clusters to be the same class for the Iris dataset in your implementation and show the resulting clustering.

c) How did we initialize the mean and covariance in each cluster in Assignment 2?. Search to see if there is a better starting strategy for selecting the initial means and covariances. Explain the strategy briefly.

d) For high dimensional data, the GMM algorithm can be a problem as it needs to compute the inverse and determinant of the covariance matrix. Despite the advantages of GMM, such as its probabilistic interpretation and robustness against observation noise, maximum-likelihood estimation for GMM does not perform well in high-dimensional setting. If you are given high dimensional data (e.g., cancer dataset for 5 types of cancer), describe, how will you reduce the dimensionality of data before applying GMM.

e) How does GMM compare with the simple K means clustering? Briefly describe the ideas behind Kmeans++ and K-medoid algorithms.

Problem #3: For data visualization, we have discussed four algorithms in the class. Principle Component Analysis (PCA) is one where we use two principle components to visualize the data in two dimensions. In Assignment 4, you programmed the PCA from scratch. The PCA is also pre programmed in the sklearn library. Create a new project called DataVisualization. Add a file called Utils.py with the following code in it.

a) sklearn library provides the MDS algorithm which can be called similar to the code for pca, tsne, or umap. Add a file to the project called mds.py where you write the code for MDS visualization and show the resulting plot for the visualization of the cancer dataset.

b) If you compare the PCA visualization with the MDS visualization in 2-D, both do not appear to separate the data into their own class clusters. Why PCA and MDS have similarity in this behavior? Verify on the MNIST dataset as well. Note that the code for PCA, TSNE, UMAP has already commented lines for testing the visualization on the MNIST dataset, you just need to uncomment the lines.

c) TSNE has the parameter of perplexity, and UMAP has parameters of n_neighbors and min_distance. Briefly explain the purpose of these and then determine the best perplexity for the MNIST and cancer datasets. Similarly, determine the best UMAP parameters for the cancer MNIST and cancer datasets, and show the plots.

d) In your opinion, which of the four algorithms performs best on the MNIST dataset and which one performs worst, and why?. Similarly, which of the four algorithms performs best on the cancer dataset and which one performs worst, and why?

Problem #4: What is the difference between SVD and Eigen decomposition. Briefly explain by describing the steps of the two algorithms.

Problem # 5: Write a 2-3 page summary of the given paper

Attachment:- Cyber Forensics.rar

Reference no: EM133115360

Questions Cloud

Find the net cash outflow for the new machine : Firm X is considering the replacement of an old machine with one that has a purchase price of $60,000. The current market value of the old machine is $20,000 bu
How much additional profit will be generated : A one-time customer has offered to buy 2,000 units at a special price of £48 per unit. How much additional profit will be generated
Million of long-term debt : External Funds Needed Cheryl Colby, CFO of Charming Florist Ltd., has created the firm's pro forma balance sheet for the next fiscal year. Sales are projected t
Generate annual profits : You are the CFO of a drug company, and you must decide whether to invest 70M dollars in R&D for a new drug. If you conduct the R&D, you believe that there is a
CPSC 552 Cyber Forensics Assignment : CPSC 552 Cyber Forensics Assignment Help and Solution, California State University Fullerton - Assessment Writing Service
How much more will you have to deposit as a lump sum : When you retire 40 years from now, you want to have £1.2 million. How much more will you have to deposit as a lump sum
Construct a price-weighted index for these three stocks : Construct a price-weighted index for these three stocks, and compute the percentage change in the series for the period from T to T +1
Describe any valuation metric scott mcnealy : Discuss Scott McNealy s comments about sentiment in his interview with Fortune magazine. In your answer, describe any valuation metric Scott McNealy mentions. D
Digital business model : Prepare a business report that deconstructs a chosen organization's (PELOTON) digital business model

Reviews

Write a Review

Basic Computer Science Questions & Answers

  Integer parameter and returns the total

Create a function called sumTo which takes an integer parameter and returns the total of all of the numbers from 1 to that number.

  Explain how erm adoption and implementation

Explain how ERM adoption and implementation in the higher education (HE) environment differs from the for-profit environment.

  Industry experts believe blockchain is technology

Industry experts believe blockchain is a technology that has the potential to affect the business of most IT professionals in the next five years.

  Explain the various microsoft office applications available

Discuss based on your personal experience as well as explain the various Microsoft Office applications available

  What emerging it/is do you think will be popular

What current IT/IS do you think will still be relevant ten years from now?What current popular IT/IS do you think will still be obsolete ten years from now?What emerging IT/IS do you think will be popular and widely adopted in the future?

  Run the binary search method on a set of unordered data

Run both methods, searching for the same number, say 734, with each method. Compare the values of compCount after running both methods. What is the value of compCount for each method? Which method makes the fewest comparisons?

  Capital gains on the sale of the old printer tax

If this machine is acquired, it is anticipated that the following current account changes would result:

  Pitt fitness database

Assume that you designed the field LengthOfTime in the Classes table in the Pitt Fitness database.

  Write instructions to move value of register and add them

Write instructions to move value 34H into register A and value 3FH into register B, then add them together. Where is the result

  Experienced this phenomenon

What has been the attitude of work colleagues to sharing their knowledge? Have you found them to be willing to share, or has hoarding been more typical?

  Social science researchers ask

Discuss how the manipulation of other factors can cause a change of behavior or create unethical or impractical situations.

  Disaster recovery plan strategy for the organization

Utilizing your comprehensive security plan outline that is atttached as a guide, develop the business continuity and disaster recovery plan strategy for the organization. This is another piece that will go into the final paper

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd