Reference no: EM133300734
Detecting Spam Email (from the UCI Machine Learning Repository):
A team at Hewlett Packard collected data on a large amount of email-messages from their postmaster and personal email for the purpose of finding a classifier that can separate email-messages that are spam vs. non-spam (AKA "ham"). The spam concept is diverse: it includes advertisements for products or websites, "make money fast" schemes, chain letters, pornography, etc. and so on. The definition used here is "unsolicited commercial e-mail". The file Spambase.xls contains information on 4601 email-messages, among which 1813 are tagged "spam". The predictors include 57 attributes, most of them are the average number of times a certain word (e.g., mail, George) or symbol (e.g., #, !) appears in the email. A few predictors are related to the number and length of capitalized words.
1. To reduce the number of predictors to a manageable size, examine how each predictor differs between the spam and non-spam emails by comparing the spam-class average and non-spam-class average. Which are the 11 predictors that appear to vary the most between spam and non-spam emails? From these 11, which words/signs occur more often in spam?
2. Partition the data into training and validation sets, then perform a discriminant analysis on the training data using only the 11 predictors.
3. If we are interested mainly in detecting spam messages, is this model useful? Use the confusion matrix, lift chart, and docile chart for the validation set for the evaluation.
4. In the sample, almost 40% of the email-messages were tagged as spam. However, suppose that the actual proportion of spam messages in these email accounts is 10%. Compute the constants of the classification functions to account for this information.
5. A spam filter that is based on your model is used, so that only messages that are classified as non-spam are delivered, while messages that are classified as spam are quarantined. In this case mis-classifying a non-spam email (as spam) has much heftier results. Suppose that the cost of quarantining a non-spam email is 20 times that of not detecting a spam message. Compute the constants of the classification functions to account for these costs (assume that the proportion of spam is reflected correctly by the sample proportion).