Reference no: EM132389326
1. Preprocessing
1a: Write a function preprocess (text, stop_words) which performs these steps:
- word tokenization (with NLTK)
- remove punctuation
- remove "stop words" from the text; the stop words are given as a set; the words should be matched case insensitively.
To get all points, use one or more list comprehensions to achieve the filtering.
When using Topic Models, it is common to chop long texts into chunks with a fixed number of tokens. For example, we might want to chop a novel into chunks of exactly 1000 tokens (regardless of where sentences or chapters end).
1b: Write a function chunker(tokens, n) that takes a text as string, and returns a list with chunks of n tokens each. Hint: the extra optional arguments to range() might come in handy.
1c: Integrate the previous two functions into a function chunk_text (input_filename, output_filename, stop_words, chunk_length) that:
- reads a file
- calls the preprocess function on the result
- calls the chunker function on that result
- writes those chunks to a file, with one chunk per line.
2. Movie scripts
The year is 1994. You are the agent of Tom Cruise. Tom is offered the part of "Ethan- in the movie Mission Impossible. However, Tom only wants the part if his role is so big that he has more than 2.5x as many lines as the second-most prominent character. Unfortunately, you don't have the time to read the script (it is in the attached file mi.txt), so you will have to write a program that counts the number of lines each character has. Fortunately, movie scripts are formatted in a very particular way, with a certain number of spaces for each type of line. Recall the exercise in week 3 about Romeo & Juliet which solved a related problem.
2a: Your mission, should you choose to accept it, is to plot the top 20 characters with the most lines of dialogue in the script in a bar plot. The plot should have names on the y-axis and the number of lines for each character on the x-axis. Note that "ETHAN (CONT'D)" is not a name, the part in parentheses should be stripped off; for simplicity, count it as a separate line of dialogue even though it indicates a continuation of a previous line.
2b: Now show a similar bar plot for the sequel, Mission Impossible 2: mi2.txt. Make sure your code is re-usable so that you don't have to repeat a lot of code to do this.
2c: Adapt your function so that it produces a Series with the lines of dialogue; the index should have the name of the character that's speaking. Show the first 5 lines in the script.
3. Tweets
We will look at tweets related to a crisis to analyze how different parties communicate about crises. In particular, we will look at the 2013 NY train crash. The data is in the directory 2013_NY_train_crash.
3a: Load the file called '2013_ NY_train_crash-tweets_labeled.csv' into a dataframe called tweets. You may want to rename the columns, because the column names include leading spaces which is error prone.
This file does not contain timestamps. For that load the other file '2013_NY_train_crash-tweetids_entire_period.csv. Pass the option parse_dates=[ 'Timestamp' ] to properly load the timestamps as times instead of strings. This file contains duplicate rows which cause problems. Find a Pandas method to drop the duplicate rows.
Now take the column with the "Timestamp" and add it to the tweets DataFrame as a new column. Note that the timestamps are in the UTC timezone, not local NY time (EST timezone).
You should now have a DataFrame with tweets, timestamps and three other columns with manually annotated labels about each tweet. Show the first 5 rows in the dataframe.
3b: we are interested in knowing how different parties report about victims. For example: who is quicker to report on a disaster, the media or outsiders? Does the former react to the latter, or vice versa?
Select all tweets about "Affected individuals". For the resulting tweets we are interested in contrasting those that are close in time to the disaster (before 16:00 UTC), with tweets which are sent later. Add a column 'later' indicating whether the tweet was after '2013-12-01 16:00:00' . Note that you can compare the Timestamp column to a string with this time to achieve this: tweets ( 'Timestamp' > ' 2013-12-01 16:00:00'
Create a contigency table and a bar plot showing the number of tweets depending on the source and whether the tweet is 'later' or not.
3c: think of a simple hypothesis to explore on this dataset and show the results.
Attachment:- Assignment Files.rar