Tweet Sentiment Analysis Assignment

Assignment Help Other Subject

Reference no: EM132897555

Assignment: Tweet Sentiment Analysis

1 Introduction

Sentiment analysis is an application of Natural Language Processing (a branch of Artificial Intelli- gence) that is concerned with detecting the sentiment of text. A common dimension for measuring sentiment uses labels POSITIVE, NEGATIVE and NEUTRAL; there are many other possibilities as well (e.g. how strong the sentiment is, how active vs subdued it is, etc). Figure1contains two sample tweets about the current series Falcon and the Winter Soldier, one positive and one negative. Social media is a particularly popular arena for deploying sentiment analysis: companies want to know how their products are being perceived, etc. Consequently, there are many organisations offering apps or services for building them; a screenshot from a demo of such an app is given in Figure2. 2 The earliest and simplest techniques for carrying out sentiment analysis (although this type of approach is still in fact widely used) just carried out keyword matching in the text, based on words from a source of words that have known sentiment (a SENTIMENT LEXICON). Often, these lexi- cons don't have extensive coverage: there are many words with sentiment that aren't included in them, particularly in the case of social media text, where misspellings, abbreviations and slang are common. Consequently, there are other approaches to the task: there's a large class of ma- chine learning3 techniques applied, as well as other techniques like LABEL PrOPAGATION,4 where sentiment labels are propagated through a graph structure.
In this assignment, you'll work with a set of real tweets collected by researchers who developed one of the first approaches to sentiment analysis of tweets,5 and build your own tweet sentiment analyser. Early stages of the assignment just use a keyword-based approach, building up to a simple version of label propagation later.

Your Tasks

For your tasks, you'll be adding attributes and methods to existing classes given in the code bundle accompanying these specs. Where it's given, you should use exactly the method stub provided for implementing your tasks. Don't change the names or the parameters. You can add more functions if you like.

The two classes provided are Tweet and TweetCollection. The former represents an individual tweet, and the latter a collection of them.

Note that the Tweet class contains two enumerated types: Polarity represents the possible sen- timent polarity values for a tweet (POSitive, NEGative, NEUTral or NONE); and Strength, for the strength of polarity (WEAK, STRONG), for the Distinction-level tasks.

Pass Level

To achieve at least a Pass (≥ 50%) for the assignment, you should do all of the following. You'll be basically implementing a simple keyword-based method for sentiment analysis of tweets, counting up the numbers of positive and negative words in a tweet to determine the PREDICTED POLARITY of the tweet. (This differs from the GOLD POLARITY, which is what has been decided as the true polarity of the tweet; you're going to try to see how well you can predict it based on the content of the tweet.)

T1 You will choose approprate representations for the Tweet class. You may or may not choose to base it on other classes I've supplied (Vertex, VertexIDList). Material from weeks 9-11 of lectures will be particularly relevant in helping you decide.
You'll need to write a constructor based on your chosen representation that instantiates an empty tweet.
public Tweet(String p, String i, String d, String u, String t) {
// Constructor

// TODO
}

T2 You'll also need to do the same for the TweetCollection class. You might want to look ahead at the Credit-level tasks, which require the class to have some graph-like properties, to make the decision here. (Alternatively, you can just start with some underlying representation that will let you implement all of the Pass-level task functions, and then revise later.)
public TweetCollection() {
// Constructor

// TODO
}

Also write the following two functions.9
public Tweet getTweetByID (String ID) {
// PRE: -
// POST: Returns the Tweet object with that tweet ID

// TODO
9Note that the code bundle also includes return null; statements in these functions. This is so that the class still compiles even when there are functions not yet implemented.

}

public Integer numTweets() {
// PRE: -
// POST: Returns the number of tweets in this collection

// TODO
}

T3 Write some getter functions for the properties of the tweet passed in via the constructor, implemented using your chosen representation for tweets.
public Polarity getGoldPolarity() {
// PRE: -
// POST: Returns the gold polarity of the tweet

// TODO
}

public String getID() {
// PRE: -
// POST: Returns ID of tweet

// TODO
}

public String getDate() {
// PRE: -
// POST: Returns date of tweet

// TODO
}

public String getUser() {
// PRE: -
// POST: Returns identity of tweeter

// TODO
}

public String getText() {
// PRE: -
// POST: Returns text of tweet as a single string

// TODO
}

Also write a getter and setter function for predicted polarity, which you will use when trying to predict the polarity of a tweet based on the content of its text.
public Polarity getPredictedPolarity() {
// PRE: -
// POST: Returns the predicted polarity of the tweet

// TODO
}

public void setPredictedPolarity(Polarity p) {
// PRE: -
// POST: Sets the predicted polarity of the tweet

// TODO
}

Note that I have provided the implementation of a function public String[] getWords(). This takes the text of a tweet and splits it into words (‘tokenises' it) in a standard way, which is returned as an array of String. You will want to use this for other functions when you check whether a tweet contains a particular word.

T4 I've supplied most of the content of a function public void ingestTweetsFromFile(String fInName)

that will read in the content from a specified .csv file fInName, and for each line in that file it instantiates a new Tweet using the constructor.10 You need to add code to insert that into whatever representation you have chosen for your collection of tweets from Task T2.

T5 Write a function in TweetCollection that will read in a file of basic sentiment words (i.e. words paired with their sentiment, as described in Sec2), and store them in whatever repre- sentation you choose for sentiment words.
public void importBasicSentimentWordsFromFile (String fInName) throws IOException {
// PRE: -
// POST: Read in and store basic sentiment words in appropriate data type

// TODO
}

Also write a getter function. If w represents a word that does not have an associated sentiment, the function should return NONE.
public Polarity getBasicSentimentWordPolarity(String w) {
// PRE: w not null, basic sentiment words already read in from file
// POST: Returns polarity of w

// TODO

return null;
}

T6 In TweetCollection, write the following function that will assign predicted sentiments based on the content of the tweet. To assign sentiment, use the following rule:
• If there are no positive or negative words in the tweet, assign predicted sentiment NONE.
• If there are more positive than negative words, assign predicted sentiment POS.
10The csv reader requires the opencsv jarfile, which you'lll have to include as a library in the Eclipse Java project,
along with one of its dependents. See the notes on the iLearn page with the code bundle for help with doing this.

• If there are more negative than positive words, assign predicted sentiment NEG.
• Otherwise, assign NEUT.
public void predictTweetSentimentFromBasicWordlist () {
// PRE: Basic word sentiment already imported
// POST: For all tweets in collection, tweet annotated with predicted sentiment
// based on count of sentiment words in sentWords

// TODO
}

Consider the text of tweet 1467811594 (see Figure3), with text "@LOLTrish hey long time no see! Yes.. Rains a bit ,only a bit LOL , I'm fine thanks , how's you ?". Words from the file basic-sent-words.txt are indicated as negative or positive. This tweet would then get the predicted polarity POS. (Note that the matching with words in the sentiment words file should be done using the tokenisation provided by getWords().)

T7 Write a function that calculates the accuracy of your sentiment predictions for each tweet.
Accuracy is defined as follows:

• Count up the number of tweets for which the predicted polarity is the same as the gold polarity, as long as this is not NONE (NUMCORRECT).
• Count up the number of tweets for which a prediction is made, i.e. not NONE (NUMPREDICTED).
• Accuracy is the proportion NUMCORRECT / NUMPREDICTED. If NUMPREDICTED is 0, the function should return 0.
public Double accuracy () {
// PRE: -
// POST: Calculates and returns accuracy of labelling

// TODO
}

For the small sample of 10 tweets, and the sentiment words in basic-sent-words.txt, you should get an accuracy of 0.4. (See Fig3.)
(Note: Don't include tweets in either numerator or denominator that have gold polarity
NONE.)

T8 Write a function that calculates the coverage of your sentiment predictions for each tweet.
Coverage is defined as follows:
• As in Task T7, count up the number of tweets for which a prediction is made, i.e. not
NONE (NUMPREDICTED).
• Coverage is the proportion NUMPREDICTED / total number of tweets.
public Double coverage () {
// PRE: -
// POST: Calculates and returns coverage of labelling

// TODO
}

For the small sample of 10 tweets, and the sentiment words in basic-sent-words.txt, you should get a coverage of 0.5. (See Fig3.)
(Note: Don't include tweets in either numerator or denominator that have gold polarity
NONE.)
Credit Level

To achieve at least a Credit (≥ 65%) for the assignment, you should do the following. You should also have completed all the Pass-level tasks.
In the approach to sentiment labelling from the Pass-level tasks, you'll notice that you ended up with quite a few unlabelled tweets, because several of them didn't contain words from the sentiment lexicon. As noted in Sec1, another kind of technique is to propagate sentiment labels from one tweet to another (similar) one.
In the Credit-level tasks, essentially what you'll be doing is building a graph that links together ‘similar' tweets (for some definition of similarity), and then identifying CONNECTED COMPONENTS in the graph with a view to propagating sentiment labels via the edges in the graph within those connected components.

T9 Implement functions in class Tweet for handling neighbours. You may need to augment your representation of tweets to do this.
public void addNeighbour(String ID) {
// PRE: -
// POST: Adds a neighbour to the current tweet as part of graph structure

// TODO
}

public Integer numNeighbours() {
// PRE: -
// POST: Returns the number of neighbours of this tweet

// TODO
}

public void deleteAllNeighbours() {
// PRE: -
// POST: Deletes all neighbours of this tweet

// TODO
}

public Vector<String> getNeighbourTweetIDs () {
// PRE: -
// POST: Returns IDs of neighbouring tweets as vector of strings

// TODO
}

public Boolean isNeighbour(String ID) {
// PRE: -
// POST: Returns true if ID is neighbour of the current tweet, false otherwise

// TODO
}

T10 You'll be constructing a graph of tweets by adding an edge between two tweets if they share a word. For example, in the sample tweets of Fig3, the tweets with IDs 1467811184 ("my whole body . . . ") and 1467811372 ("@Kwesidei not the whole crew") share the word whole, and so there will be an edge between the two.

I have constructed an inverse index (see Sec2) that contains all relevant words from a set of tweets, and following that the list of tweets in which each word occurs.11
In TweetCollection, write a function that reads in the contents of this file, and returns this information as a map from strings (the words) to a vector of strings (a vector of the IDs of the tweets that contain the word).
public Map<String, Vector<String>> importInverseIndexFromFile (String fInName) throws IOException
// PRE: -
// POST: Read in and returned contents of file as inverse index
// invIndex has words w as key, IDs of tweets that contain w as value

// TODO
}

T11 Now write the function that constructs that graph in TweetCollection.
public void constructSharedWordGraph(Map<String, Vector<String>> invIndex) {
// PRE: invIndex has words w as key, IDs of tweets that contain w as value
// POST: Graph constructed, with tweets as vertices,
// and edges between them if they share a word

// TODO
}

For the running example, the graph should look as in Fig4.

After this function is run, queries to tweets about neighbours should return appropriate responses. For example, d.getTweetByID("1467810672").numNeighbours() should return 1.

11Note that the inverse index might refer to some tweets that are not actually in the graph. For the specific files used in the running example (i.e. training-10.csv, inv-index-50.txt) you'll find some tweet IDs in the inverse index that don't appear in the graph; the ones of use to you for this task are the ones that do appear in the graph. (inv-index-50.txt was in fact constructed from training-50.csv, which training-10.csv is a subset of.

T12 As noted above, you'll be propagating sentiment labels across connected components. Write a function that, according to whatever graph representation you have chosen, annotates tweets as belonging to a particular connected component.
public void annotateConnectedComponents() {
// PRE: -
// POST: Annotates graph so that it is partitioned into components

// TODO
}

(Note: This won't be tested directly; it will just be tested indirectly via the functions below.)

T13 Write a function that, after components have been identified as in T12, counts the number of connected components.
public Integer numConnectedComponents() {
// PRE: Connected components have been annotated
// POST: Returns the number of connected components

// TODO
}

For the running example, the answer would be 7.

T14 Write a function that, after components have been identified as in T12, counts the number of times a particular sentiment label appears in a connected component, where the particular connected component is identified by a tweet ID contained in that component.
public Integer componentSentLabelCount(String ID, Polarity p) {
// PRE: Graph components are identified, ID is a valid tweet
// POST: Returns count of labels corresponding to Polarity p in component containing ID

// TODO
}

For the running example, componentSentLabelCount("1467811372", Polarity.POS), for instance, would give the value 1. There are two tweets in that component, with tweet IDs 1467811372 and 1467811184. You'll see from Figure3that the latter tweet has predicted polarity POS and there is no prediction (NONE) for the former tweet.
(High) Distinction Level

To achieve at least a Distinction (75 - 100%) for the assignment, you should do the following. You should also have completed all the Credit-level tasks.
The main goal for this level is to propagate sentiment labels via the edges in the graph defined above, and the majority labels in those connected components. Additionally, there will be a task on using a richer sentiment labelling scheme.

T15 Write a function to propagate a particular polarity p across a particular component. The component can be identified by the ID of any tweet in that component; and the function has a binary flag to indicate whether the tweet should only be labelled with polarity p if its

existing label is NONE, or whether it should always be labelled with p regardless of its existing label.
public void propagateLabelAcrossComponent(String ID, Polarity p, Boolean keepPred) {
// PRE: ID is a tweet id in the graph
// POST: Labels tweets in component with predicted polarity p
// (if keepPred == T, only tweets w pred polarity None; otherwise all tweets

// TODO
}

For example, propagateLabelAcrossComponent("1467811184", Polarity.NEUT, Boolean.TRUE)

would result in the tweet with ID 1467811184 keeping its predicted polarity of POS and the tweet with ID 1467811372 being labelled with polarity NEUT.

T16 The rule for propagating sentiment across a connected component involves determining the majority sentiment of that component, as follows (analogous to the tweet-labelling rules of T6):
• If there are no positive or negative tweets in the component, majority sentiment is NONE.
• If there are more positive than negative tweets, majority sentiment is POS.
• If there are more negative than positive tweets, majority sentiment is NEG.
• Otherwise, NEUT.
Then propagate that across the component as in T15.
public void propagateMajorityLabelAcrossComponents(Boolean keepPred) {
// PRE: Components are identified
// POST: Tweets in each component are labelled with the majority sentiment for that component
// Majority label is defined as whichever of POS or NEG has the larger count;
// if POS and NEG are both zero, majority label is NONE
// otherwise, majority label is NEUT
// If keepPred is True, only tweets with predicted label None are labelled in this way
// otherwise, all tweets in the component are labelled in this way

// TODO
}

In the running example, for keepPred == True, the only tweet to gain a new predicted polarity is the one with ID 1467811372, which becomes POS.

T17 There is a file available of finegrained sentiment (see Sec2). Write functions, analogous to those of T5, as follows.
public void importFinegrainedSentimentWordsFromFile (String fInName) throws IOException {
// PRE: -
// POST: Read in and store finegrained sentiment words in appropriate data type

// TODO
}

public Polarity getFinegrainedSentimentWordPolarity(String w) {
// PRE: w not null, finegrained sentiment words already read in from file
// POST: Returns polarity of w

// TODO
}

public Strength getFinegrainedSentimentWordStrength(String w) {
// PRE: w not null, finegrained sentiment words already read in from file
// POST: Returns strength of w

// TODO
}

Note that in the file of finegrained sentiment, there may be multiple occurrences of individual words with different sentiment. (For example, fun is both negative and positive.) This can be because when words are used as different parts of speech (e.g. nouns, verbs) they have different sentiment. Since we're ignoring parts of speech, just use the last sentiment mentioned in the file for a particular word.

T18 There are two strengths of finegrained sentiment, STRONG and WEAK. Write a function that adapts the method from T6, but that assigns weights to negative and positive words depend- ing on whether they are strong or weak.
public void predictTweetSentimentFromFinegrainedWordlist (Integer strongWeight,
Integer weakWeight) {
// PRE: Finegrained word sentiment already imported
// POST: For all tweets, tweet is annotated with predicted sentiment
// based on weighted count of sentiment words in sentWords

// TODO
}

In the running example, for the tweet with ID 1467810672 ("is upset that . . . "), and assigning weight 2 to strong sentiment and 1 to weak sentiment, the negative words would have weight 5 in total (upset weak, cry and blah strong), and the positive words 2 in total (might strong).
Bonus

This section is not worth any marks: you can get 100% by completing the above functions correctly. This is only an additional task for anyone interested.
T19 Write your own function to assign predicted sentiment to tweets. The goal is to produce an assignment that has high accuracy and high coverage. (Obviously, it should not use the tweet's gold sentiment. The marking of this function will use data that does not make available the gold sentiment.)
public void myTweetSentimentPredictor () {
// PRE: -
// POST: All tweets are annotated with sentiment

// TODO
}

4 What To Hand In
In the submission page on iLearn for this assignment you must include the following:
Submit a zip file consisting of all the Java classes (i.e. the .java files) in the package
from the original assignment code bundle.

Instructions that you should follow on how to create the zipfile are available in iLearn: you'll find them with all the assignment 2 material.
Your file must leave unchanged the specification of already implemented functions, and include your implementations of your selection of method stubs outlined above.
Do not change the names of the method stubs because the auto-tester assumes the names given. Do not change the package statement. You may however include additional auxiliary methods if you need them.
Please note that we are unable to check individual submissions and so it is very important to abide by the above submission instructions.

5 Changelog
• 6/5/21: Assignment released.
• 11/5/21:
- Added edge case for Task T7.
- Corrected Tasks T15 and T16 to be consistent with code bundle.
- Corrected which tweet was POS and which NONE in Task T14.
- Made the final task a bonus one rather than one counting for marks.
• 17/5/21: Added footnote with more detail on inverse indexes.

Attachment:- Tweet Sentiment Analysis.rar

Reference no: EM132897555

Questions Cloud

Develop talent through diversity : To be the number 1 with customer service To develop talent through diversity

Create a work breakdown structure for an ebook project : Do you think developing a standard job template would be useful for Global Green Books? Why? What advantages could it give them in planning work?

Occupational safety and health administration : how the Occupational Safety and Health Administration (OSHA) operates and what to expect from an OSHA inspection.

What is the bond capital gain or loss yield : A 10-year, 12% semiannual coupon bond with a par value of $1,000 may be called in 4 years. What is the bond capital gain or loss yield

Tweet Sentiment Analysis Assignment : Tweet Sentiment Analysis - You'll also need to do the same for the TweetCollection class. You might want to look ahead at the Credit-level tasks

What kind of rf safety and security safeguards : What kind of RF safety and security safeguards should be built in to credit cards, passports, and other personal identification tags?

How does the sorting make work or daily life easier : In our daily lives, we use sorting a lot. Please describe at least two real-world examples of using sorting. How does the sorting make work or daily life easier

Discuss the general framework for classification : Note the basic concepts in data classification. Discuss the general framework for classification. What is a decision tree and decision tree modifier?

What is the current market price of these bonds : The coupon interest rate is 8%. The bonds have a yield to maturity of 9%. What is the current market price of these bonds

User Account

All Pages