Reference no: EM133664832
Machine Learning
Task 1 - Data Loading and data preparation
Dataset
We have given you a dataset of several thousand single-sentence reviews collected from three domains: imdb.com, amazon.com, yelp.com. Each review consists of a sentence and a binary label indicating the emotional sentiment of the sentence (1 for reviews expressing positive feelings; 0 for reviews expressing negative feelings). All the provided reviews in the training and test set were scraped from websites whose assumed audience is primarily English speakers, but of course may contain slang, misspellings, some foreign characters, and many other properties that make working with natural language data challenging (and fun!).
Your goal is to develop a binary classifier that can correctly identify the sentiment of a new sentence. Here are some example positive sentences:
Dataset acknowledgment
This dataset comes from research work by D. Kotzias, M. Denil, N. De Freitas, and P. Smyth described in the KDD 2015 paper 'From Group to Individual Labels using Deep Features'. We are grateful to these authors for making the dataset available.
Provided data
You are given the data in CSV file format, with 2400 input,output pairs in the training set, and 600 inputs in the test set.
Training set of 2400 examples
x_train.csv : input data
Column 1: 'website_name' : one of ['imdb', 'amazon', 'yelp']
Column 2: 'text' : string sentence which represents the raw review
y_train.csv : binary labels to predict
Column 1: 'is_positive_sentiment' : 1 = positive sentiment, 0 = negative
Test set of 600 examples
x_test.csv : input data
Suggested Way to Load Data into Python
We suggest loading the sentence data using the read_csv method in Pandas.
Pre-processing
We'll often refer to each review or sentence as a single "document". Our goal is to classify each document into positive or negative labels. We suggest that you remove all the punctuation and convert upper case to lower case for each example.
Task 2 - Feature representation
There are many possible approaches to feature representation, transforming any possible natural language document (often represented as an ordered list of words which can be of variable length) into a feature vector xn of a standard length. As a data scientist, you should be able to work with any feature representation for your given task. In this assignment you need to learn about Bag-of-words feature representation which is a popular one in Natural language processing (NLP) and exploit it for constructing the input features and training your classifier.
Background on Bag-of-Words Representations
The "Bag-of-Words" (BoW) representation assumes a fixed, finite-size vocabulary of V possible words is known in advance, with a defined index order (e.g. the first word is "stegosaurus", the second word is "dinosaur", etc.).
Each document is represented as a count vector of length V, where entry at index v gives the number of times that the vocabulary word with index v appears in the document.
You have many design decision to make when applying a BoW representation:
How big is your vocabulary?
Do you exclude rare words (e.g. appearing in less than 10 documents)?
Do you exclude common words (like 'the' or 'a', or appearing in more than 50% of documents)?
Do you only use single words ("unigrams")? Or should you consider some bigrams (e.g. 'New York' or 'not bad')?
Do you keep the count values, or only store present/absent binary values?
Do you use smart reweighting techniques like term-frequency/inverse-document-frequency?
The key constraint with BoW representations is that each input feature must easily map to one human-readable unigram, bigram, or n-gram in a finite vocabulary.
You should feel free to take advantage of the many tools that sklearn provides related to BoW representations:
User Guide for Bag of Words tools in sklearn.feature_extraction.text
sklearn.feature_extraction.text.CountVectorizer
sklearn.feature_extraction.text.TfIdfVectorizer
Task 3 - Classification and Evaluation
Now you've finished Task 2 and obtained BoW feature representation, you can start to train a classifier of your choice by carrying out the following steps. Please note that you must choose an appropriate classifier for this task based on your understanding of the lectures.
Splitting the training data into a training set and a validation set. You may need to split the data at the different ratios, to gain the best choice.
You must describe the way you've used for splitting the data to ensure the reproducibility of your work in the report required in Task 4.
Performing the following steps for the chosen classifier on your data suites:
Identifying the method in the "scikit-learn" package which implements the chosen model.
Selecting appropriate model parameters and using them to train the model via the above-identified method. You must elaborate and justify the way you've used for parameter selection in the report required in Task 4.
You should use best practices in hyperparameter selection techniques to avoid overfitting and generalize well to new data. Within your hyperparameter selection, you should use cross-validation over multiple folds to assess the range of possible performance numbers that might be observed on new data. You should use at least 3 fold cross validation to perform a hyperparameter search.
Evaluating the performances of the model on the validation set and test set, respectively, in terms of
Confusion matrix
Classification accuracy
Precision
Recall
F1 score
and report them in the report required in Task 4.
Task 4 - Report
In this task, you are asked to write a report to elaborate your analyses and findings in Tasks 1, 2 and 3. Your report should contain the following sections:
Bag-of-Words Design Decision Description
Well-written paragraph describing BoW, n-grams and your chosen BoW feature representation pipeline, with sufficient detail that another student in this class could reproduce it. You are encouraged to use just plain English prose, but you might include a brief, well-written pseudocode block if you think it is helpful.
You should describe and justify all major decisions, such as:
how did you "clean" and "standardize" the data? (punctuation, upper vs. lower case, etc)
how did you determine the final vocabulary set? did you exclude words, and if so how?
what was your final vocabulary size (or ballpark size(s), if size varies across folds because it depends on the training set)?
did you use unigrams or bigrams?
did you use counts or binary values or something else?
how did you handle out-of-vocabulary words in the test set?
Cross Validation and Hyperparameter Selection Design Description
Well-written paragraph describing how you will use cross-validation and hyperparameter selection to assess and refine the classifier pipeline you'll develop.
You should describe and justify all major decisions, such as:
What performance metric will your search try to optimize on heldout data?
What hyperparameter search strategy will you use?
What is the source of your heldout data for performance estimates? (how many folds? how big is each fold? how do you split the folds?).
Given a selected hyperparameter configuration created using CV by training models across several folds, how will you then build a "final" model to apply on the test set?
Hyperparameter Selection Figure for Classifier
Using your BoW preprocessing plus any classifier of your choice, your goal is to train a model that achieves the best performance on held-out data.
You should use at least 3 fold cross validation to perform a hyperparameter search. Your report should include a figure and paragraph summarizing this search.
Analysis of Predictions for the Classifier
For evaluating your classifier's performance, use Confusion matrix, Accuracy, Precision, Recallcand F1-score. In a table or figure, show some representative examples of false positives and false negatives for your chosen classifier. If possible, try to characterize what kinds of mistakes it is making.
To give specific examples, you could look at any of these questions:
does it do better on longer sentences or shorter sentences?
does it do better on a particular kind of review (amazon or imdb)?
does it do better on sentences without negation words ("not", "didn't", "shouldn't", etc.)?
Do you notice anything about these sentences that you could use to improve performance?
Performance on Test Set
Apply your classifier to the test sentences in x_test.csv. In your report, include a summary paragraph stating your ultimate test set performance (based on the metrics you used for validation set), compare it to your previous estimates of heldout performance from cross-validation, reflect on any differences and discuss the potential reasons.
Note: Any graph must contain appropriate titles, axis labels, etc. to make it self-explained. Graphs should be clear enough for readers to read.
The report must be saved in the PDF format and named "report.pdf" for submission.
It MUST be written in the single column format with font size between 10 and 12 points and no more than 8 pages (including tables, graphs and/or references). Penalties will apply if the report does not satisfy these requirements. Moreover, the quality of the report will be considered when marking, e.g. organisation, clarity, and grammatical mistakes.
Please remember to cite any sources which you've referred to when doing your work!