Implement the cross-validation method

Assignment Help Computer Engineering
Reference no: EM133284099

Text Classification with Naive Bayes

In this assignment, you will implement the Naive Bayes classification method and use it for sentiment classification of customer reviews.Write a report containing your answers, including the visualizations.Submit your report and your Python code/notebook.

Preliminaries

Read up the Naive Bayes classifier: how to compute apply the Naive Bayes rule, and how to estimate the probabilities you need.
If you wish, you may also have a look at the following classic paper:

• Bo Pang, Lillian Lee, and ShivakumarVaithyanathan: Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002).
The dataset we are using was originally created for the experiments described in the following paper. The research described here addresses the problem of domain adaptation, such as adapting a classifier of book reviews to work with camera reviews.
• John Blitzer, Mark Dredze, and Fernando Pereira: Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007).

Preparatory remarks

Frequency-counting in Python. The built-in data structure called Counter is a special type of Python dictionary that is adapted for computing frequencies. In the following example, we compute the frequencies of words in a collection of two short documents. We use Counter in three different ways, but the end results are the same (freqs1, freqs2, and freqs3 are identical at the end). The Counter will not give a KeyError if you look for a word that you didn't see before.

from collections import Counter

example_documents = ['the first document'.split(), 'the second document'.split()]

freqs1 = Counter()
for doc in example_documents:
for w in doc:
freqs1[w] += 1

freqs2 = Counter()
for doc in example_documents:
freqs2.update(doc)

freqs3 = Counter(w for doc in example_documents for w in doc)

print(freqs1)
print(freqs1['the'])
print(freqs1['neverseen'])

Logarithmic probabilities. If you multiply many small probabilities you may run into problems with numeric precision: the probability becomes zero. To handle this problem, I recommend that you compute the logarithms of the probabilities instead of the probabilities. To compute the logarithm in Python, use the function log in the numpy library.

The logarithms have the mathematical property that np.log(P1 * P2) = np.log(P1) + np.log(P2). So if you use log probabilities, all multiplications (for instance, in the Naive Bayes probability formula) will be replaced by sums.

If you'd like to come back from log probabilities to normal probabilities, you can apply the exponential function, which is the inverse of the logarithm: prob = np.exp(logprob). (However, if the log probability is too small, exp will just return zero.)

Reading the review data

This is a collection of customer reviews from six of the review topics used in the paper by Blitzer et al., (2007) mentioned above. The data has been formatted so that there is one review per line, and the texts have been split into separate words ("tokens") and lowercased. Here is an example of a line.
music neg 544.txt i was misled and thought i was buying the entire cd and it contains one song
A line in the file is organized in columns:
• 0: topic category label (books, camera, dvd, health, music, or software)
• 1: sentiment polarity label (pos or neg)
• 2: document identifier
• 3 and on: the words in the document
Here is some Python code to read the entire collection.

from codecs import open
from __future__ import division

def read_documents(doc_file):
docs = []
labels = []
with open(doc_file, encoding='utf-8') as f:
for line in f:
words = line.strip().split()
docs.append(words[3:])
labels.append(words[1])
return docs, labels
We first remove the document identifier, and also the topic label, which you don't need unless you solve the first optional task. Then, we split the data into a training and an evaluation part. For instance, we may use 80% for training and the remainder for evaluation.
all_docs, all_labels = read_documents('all_sentiment_shuffled.txt')

split_point = int(0.80*len(all_docs))
train_docs = all_docs[:split_point]
train_labels = all_labels[:split_point]
eval_docs = all_docs[split_point:]
eval_labels = all_labels[split_point:]
Estimating parameters for the Naive Bayes classifier
Write a Python function that uses a training set of documents to estimate the probabilities in the Naive Bayes model. Return some data structure containing the probabilities or log probabilities. The input parameter of this function should be a list of documents and another list with the corresponding polarity labels. It could look something like this:
def train_nb(documents, labels):
...
(return the data you need to classify new instances)
Hint 1. In this assignment, it is acceptable if you assume that we will always use the pos and neg categories.
Hint 2. Some sort of smoothing will probably improve your results. You can implement the smoothing either in train_nb or in score_doc_label that we discuss below.
Classifying new documents
Write a function that applies the Naive Bayes formula to compute the logarithm of the probability of observing the words in a document and a sentiment polaritylabel.<SOMETHING>refers to what you returned in train_nb.
def score_doc_label(document, label, <SOMETHING>):
...
(return the log probability)
Sanity check 1. Try to apply score_doc_label to a few very short documents; to convert the log probability back into a probability, apply np.exp or math.exp. For instance, let's consider small documents of length 1. The probability of a positive document containing just the word "great" should be a small number, depending on your choice of smoothing parameter, it will probably be around 0.001-0.002. In any case, it should be higher than the probability of a negative document with the same word. Conversely, if you try the word "bad" instead, the negative score should be higher than the positive.
Sanity check 2. Your function score_doc_label should not crash for the document ['a', 'top-quality', 'performance'].
Next, based on the function you just wrote, write another function that classifies a new document.
def classify_nb(document, <SOMETHING>):
...
(return the guess of the classifier)
Again, apply this function to a few very small documents and make sure that you get the output you'd expect.
Evaluating the classifier
Write a function that classifies each document in the test set and returns the list of predicted sentiment labels.
def classify_documents(docs, <SOMETHING>):
...
(return the classifier's predictions for all documents in the collection)
Next, we compute the accuracy, i.e. the number of correctly classified documents divided by the total number of documents.
def accuracy(true_labels, guessed_labels):
...
(return the accuracy)
What accuracy do you get when evaluating the classifier on the test set?
Error analysis
Find a few misclassified documents and comment on why you think they were hard to classify. For instance, you may select a few short documents where the probabilities were particularly high in the wrong direction.

Cross-validation
Since our estimation of the accuracy is based on a fairly small set, your interval was quite wide. We will now use a trick to get a more reliable estimate and tighter interval.

In a cross-validation, we divide the data into N parts (folds) of equal size. We then carry out N evaluations: each fold once becomes a test set, while the other folds form the training set. We then combine the results of the N different evaluations. This trick allows us to get results for the whole dataset, not just a small test set.
Here is a code stub that shows the idea:
for fold_nbr in range(N):
split_point_1 = int(float(fold_nbr)/N*len(all_docs))
split_point_2 = int(float(fold_nbr+1)/N*len(all_docs))

train_docs_fold = all_docs[:split_point_1] + all_docs[split_point_2:]
train_labels_fold = all_labels[:split_point_1] + all_labels[split_point_2:]
eval_docs_fold = all_docs[split_point_1:split_point_2]
...
(train a classifier on train_docs_fold and train_labels_fold)
(apply the classifier to eval_docs_fold)
...
(combine the outputs of the classifiers in all folds)

Implement the cross-validation method. Then estimate the accuracy and compute a new interval estimate. A typical value of N would be between 4 and 10.

Reference no: EM133284099

Questions Cloud

What happened that alerted cliff to a potential problem : 1) What happened that alerted Cliff to a potential problem that needed to be investigated? Why was this a problem that Clifford felt
Different electoral systems affect the structure : How do different electoral systems (or the lack thereof) affect the structure and function of the executives in those countries?
Experience regarding encounter with bureaucratic agency : Share an experience regarding the encounter with a bureaucratic agency. Reflect on both the positive and negative aspects of the interaction with the agency.
How does pinterest work : When using social networks to market your business, it's important to consider how to establish your account, How does Pinterest work
Implement the cross-validation method : Implement the cross-validation method. Then estimate the accuracy and compute a new interval estimate. A typical value of N would be between 4 and 10
How much money will you have in exactly six years : Suppose that your bank pays you a 12% annual interest rate, compounded continuously, on your investments. How much money will you have in exactly six years
Implications for investment and consumer behavior : Politics is where the term "bandwagon effect" comes from, but it has wide-ranging implications for investment and consumer behavior.
What is passive listening : Question - What is passive listening? How can this be a good strategy in negotiation? (Use the textbook and at least one outside source
How might job appraisals be performed : How might job appraisals be performed to ensure fairness when a firm employs high-performance work teams, where employees work in pairs

Reviews

Write a Review

Computer Engineering Questions & Answers

  Which of the frames would be sent out from this port

We have 4 devices connected to an 8-port Hub. Each device is numbered starting with 0, and each port of the Hub is also numbered starting with 0.

  Discuss an important aspect of information technology

This week we discuss an important aspect of information technology: ethics. Please describe the term information system ethics and note some of the principles.

  How much redaction is necessary to anonymize

Discuss in 500 words, how much redaction is necessary to anonymize an electronic health record. Is it enough to redact the name? The name and address?

  How do their business and it strategies seem to match does

how do their business and it strategies seem to match? does it look like their product and service offerings are well

  How organization should choose between software as a service

Suggest how an organization should choose between Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS).

  Describe the technologies used in wearable devices

Wearable devices are the trend in today's Internet of Things. Please describe the technologies used in wearable devices.

  What is the effect of the stop f instruction

Design a circuit that would assert both BERR* and HALT* for a rerun bus cycle, whenever a signal, MEMORY_ERROR*, is asserted.

  Make an assignment to read rfc

Make an assignment to read RFC

  What will you tell your managers to convince them

How will you overcome their objections? Persuade your managers by overcoming their objections with compelling arguments.

  Implement the bubble sort algorithm to sort a variable

Write a program in MIPS assembly language that implements the bubble sort algorithm to sort a variable-sized array of signed 32-bit integers.

  What do you think is the single greatest physical threat

What do you think is the single greatest physical threat to information systems? Fire? Hurricanes? Sabotage? Terrorism? Discuss this question.

  Why institutions are reluctant to move their it to the cloud

Discuss in 500 words, why institutions are reluctant to move their IT to the cloud. Consider specific industries like education, medicine, military, etc.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd