Reference no: EM132888481
Assignment: Tweet Sentiment Analysis
Introduction
Sentiment analysis is an application of Natural Language Processing (a branch of Artificial Intelli- gence) that is concerned with detecting the sentiment of text. A common dimension for measuring sentiment uses labels POSITIVE, NEGATIVE and NEUTRAL; there are many other possibilities as well (e.g. how strong the sentiment is, how active vs subdued it is, etc). Figure1contains two sample tweets about the current series Falcon and the Winter Soldier, one positive and one negative. Social media is a particularly popular arena for deploying sentiment analysis: companies want to know how their products are being perceived, etc. Consequently, there are many organisations offering apps or services for building them; a screenshot from a demo of such an app is given in Figure2. 2 The earliest and simplest techniques for carrying out sentiment analysis (although this type of approach is still in fact widely used) just carried out keyword matching in the text, based on words from a source of words that have known sentiment (a SENTIMENT LEXICON). Often, these lexi- cons don't have extensive coverage: there are many words with sentiment that aren't included in them, particularly in the case of social media text, where misspellings, abbreviations and slang are common. Consequently, there are other approaches to the task: there's a large class of ma- chine learning3 techniques applied, as well as other techniques like LABEL PrOPAGATION,4 where sentiment labels are propagated through a graph structure.
In this assignment, you'll work with a set of real tweets collected by researchers who developed one of the first approaches to sentiment analysis of tweets,5 and build your own tweet sentiment analyser. Early stages of the assignment just use a keyword-based approach, building up to a simple version of label propagation later.
Data
There are four sorts of data you'll be using.
Tweets Early sentiment analysis work6 included the collection of a set of tweets, some for training a machine learning model for sentiment analysis, and some for evaluating how good that model is. We'll be using that same data; it includes the following information for each tweet:7
» the GOLD POLARITY of the tweet (0 = negative, 2 = neutral, 4 = positive, = not given)
» the ID of the tweet (2087)
» the DATE of the tweet (Sat May 16 23:58:44 UTC 2009)
» the query (lyx)
» the USER that tweeted (e.g. robotickilldozr)
» the TEXT of the tweet (e.g. Lyx is cool)
We'll be ignoring the query. I've written code to read in the CSV file that the data is stored in. The starting sample data you'll be working with consists of 10 tweets, with details as in Figure3.
Basic Sentiment Words There's a widely used subjectivity and sentiment lexicon8 that I've extracted data from. Each line consists of a word, followed by the typical sentiment of that word without any additional context, indicated by the string positive or negative, e.g.
Finegrained Sentiment Words The full lexicon from above also includes information about the strength of the sentiment: weaksubj indicates weak sentiment, and strongsubj strong.
type=weaksubj len=1 word1=abandoned pos1=adj stemmed1=n priorpolarity=negative type=weaksubj len=1 word1=abandonment pos1=noun stemmed1=n priorpolarity=negative type=weaksubj len=1 word1=abandon pos1=verb stemmed1=y priorpolarity=negative type=strongsubj len=1 word1=abase pos1=verb stemmed1=y priorpolarity=negative
Inverse Index The credit-level tasks and above will require constructing a graph, linking tweets that share words. I've constructed some inverse indexes that, for each word, give the IDs of tweets that contain that word.
sleep 1467814783,1467816665,1467818603
thanks 1467811594
falling 1467819022
go 1467810917,1467815924
. . .
Your Tasks
For your tasks, you'll be adding attributes and methods to existing classes given in the code bundle accompanying these specs. Where it's given, you should use exactly the method stub provided for implementing your tasks. Don't change the names or the parameters. You can add more functions if you like.
The two classes provided are Tweet and TweetCollection. The former represents an individual tweet, and the latter a collection of them.
Note that the Tweet class contains two enumerated types: Polarity represents the possible sen- timent polarity values for a tweet (POSitive, NEGative, NEUTral or NONE); and Strength, for the strength of polarity (WEAK, STRONG), for the Distinction-level tasks.
Attachment:- Tweet Sentiment Analysis.rar