Solution-Detecting Spam Email, A team at Hewlett Packard

Detecting Spam Email

Assignment Help Basic Computer Science

Reference no: EM133300734

Detecting Spam Email (from the UCI Machine Learning Repository):

A team at Hewlett Packard collected data on a large amount of email-messages from their postmaster and personal email for the purpose of finding a classifier that can separate email-messages that are spam vs. non-spam (AKA "ham"). The spam concept is diverse: it includes advertisements for products or websites, "make money fast" schemes, chain letters, pornography, etc. and so on. The definition used here is "unsolicited commercial e-mail". The file Spambase.xls contains information on 4601 email-messages, among which 1813 are tagged "spam". The predictors include 57 attributes, most of them are the average number of times a certain word (e.g., mail, George) or symbol (e.g., #, !) appears in the email. A few predictors are related to the number and length of capitalized words.

1. To reduce the number of predictors to a manageable size, examine how each predictor differs between the spam and non-spam emails by comparing the spam-class average and non-spam-class average. Which are the 11 predictors that appear to vary the most between spam and non-spam emails? From these 11, which words/signs occur more often in spam?

2. Partition the data into training and validation sets, then perform a discriminant analysis on the training data using only the 11 predictors.

3. If we are interested mainly in detecting spam messages, is this model useful? Use the confusion matrix, lift chart, and docile chart for the validation set for the evaluation.

4. In the sample, almost 40% of the email-messages were tagged as spam. However, suppose that the actual proportion of spam messages in these email accounts is 10%. Compute the constants of the classification functions to account for this information.

5. A spam filter that is based on your model is used, so that only messages that are classified as non-spam are delivered, while messages that are classified as spam are quarantined. In this case mis-classifying a non-spam email (as spam) has much heftier results. Suppose that the cost of quarantining a non-spam email is 20 times that of not detecting a spam message. Compute the constants of the classification functions to account for these costs (assume that the proportion of spam is reflected correctly by the sample proportion).

Reference no: EM133300734

Questions Cloud

What is included in a personal budget : What values or issues are important enough to get you to participate in a boycott? What are the main costs associated with higher education?

Implementing a shorter workweek for increased productivity : Report about an article about the problems facing organizations and managers on the topic "Implementing a Shorter Workweek for Increased Productivity".

Analysis of social networks within organizations : Conduct a cost benefit analysis of social networks within organizations and provide your overall view point on social networks.

Define mandatory spending and discretionary spending : Define mandatory spending.State which category within mandatory spending is most important and why.Define discretionary spending

Detecting Spam Email : A team at Hewlett Packard collected data on a large amount of email-messages from their postmaster and personal email for the purpose of finding a classifier

Why people behave certain way in organizational environment : Why do people behave a certain way in an organizational environment? What factors affect job performance, employee interaction, job commitment.

Consider alternate policies to minimize carbon emissions : Consider alternate policies to minimize carbon emissions, such as a levy on methane emissions, tax credits for buying electric cars, and support for the clean

Advantages of placing functionality in device controller : What are three advantages of placing functionality in a device controller, rather than in the kernel?

Explain increased national income that gets spent on health : explain the increased national income that gets spent on health care.Regarding the Handbook of Health Economics by Mark V. Pauly, Thomas G. McGuire

User Account

All Pages

Detecting Spam Email

Reference no: EM133300734

Reference no: EM133300734

Questions Cloud

Reviews

Write a Review

Basic Computer Science Questions & Answers

Identifies the cost of computer

Input devices

Cores on computer systems

Prepare an annual budget in an excel spreadsheet

Write a research paper in relation to a software design

Describe the forest, domain, ou, and trust configuration

Construct a truth table for the boolean expression

Evaluate the cost of materials

The marie simulator

What is the main advantage of using master pages

Describe the three fundamental models of distributed systems

Distinguish between caching and buffering

Assured A++ Grade

Academics

Major Subjects

Majors

Get In Touch

TERMS & POLICIES

HELP & SUPPORT