Create a contigency table and a bar plot

Assignment Help Other Subject

Reference no: EM132389326

1. Preprocessing

1a: Write a function preprocess (text, stop_words) which performs these steps:

word tokenization (with NLTK)
remove punctuation
remove "stop words" from the text; the stop words are given as a set; the words should be matched case insensitively.

To get all points, use one or more list comprehensions to achieve the filtering.

When using Topic Models, it is common to chop long texts into chunks with a fixed number of tokens. For example, we might want to chop a novel into chunks of exactly 1000 tokens (regardless of where sentences or chapters end).

1b: Write a function chunker(tokens, n) that takes a text as string, and returns a list with chunks of n tokens each. Hint: the extra optional arguments to range() might come in handy.

1c: Integrate the previous two functions into a function chunk_text (input_filename, output_filename, stop_words, chunk_length) that:

reads a file
calls the preprocess function on the result
calls the chunker function on that result
writes those chunks to a file, with one chunk per line.

2. Movie scripts

The year is 1994. You are the agent of Tom Cruise. Tom is offered the part of "Ethan- in the movie Mission Impossible. However, Tom only wants the part if his role is so big that he has more than 2.5x as many lines as the second-most prominent character. Unfortunately, you don't have the time to read the script (it is in the attached file mi.txt), so you will have to write a program that counts the number of lines each character has. Fortunately, movie scripts are formatted in a very particular way, with a certain number of spaces for each type of line. Recall the exercise in week 3 about Romeo & Juliet which solved a related problem.

2a: Your mission, should you choose to accept it, is to plot the top 20 characters with the most lines of dialogue in the script in a bar plot. The plot should have names on the y-axis and the number of lines for each character on the x-axis. Note that "ETHAN (CONT'D)" is not a name, the part in parentheses should be stripped off; for simplicity, count it as a separate line of dialogue even though it indicates a continuation of a previous line.

2b: Now show a similar bar plot for the sequel, Mission Impossible 2: mi2.txt. Make sure your code is re-usable so that you don't have to repeat a lot of code to do this.

2c: Adapt your function so that it produces a Series with the lines of dialogue; the index should have the name of the character that's speaking. Show the first 5 lines in the script.

3. Tweets

We will look at tweets related to a crisis to analyze how different parties communicate about crises. In particular, we will look at the 2013 NY train crash. The data is in the directory 2013_NY_train_crash.

3a: Load the file called '2013_ NY_train_crash-tweets_labeled.csv' into a dataframe called tweets. You may want to rename the columns, because the column names include leading spaces which is error prone.

This file does not contain timestamps. For that load the other file '2013_NY_train_crash-tweetids_entire_period.csv. Pass the option parse_dates=[ 'Timestamp' ] to properly load the timestamps as times instead of strings. This file contains duplicate rows which cause problems. Find a Pandas method to drop the duplicate rows.

Now take the column with the "Timestamp" and add it to the tweets DataFrame as a new column. Note that the timestamps are in the UTC timezone, not local NY time (EST timezone).

You should now have a DataFrame with tweets, timestamps and three other columns with manually annotated labels about each tweet. Show the first 5 rows in the dataframe.

3b: we are interested in knowing how different parties report about victims. For example: who is quicker to report on a disaster, the media or outsiders? Does the former react to the latter, or vice versa?

Select all tweets about "Affected individuals". For the resulting tweets we are interested in contrasting those that are close in time to the disaster (before 16:00 UTC), with tweets which are sent later. Add a column 'later' indicating whether the tweet was after '2013-12-01 16:00:00' . Note that you can compare the Timestamp column to a string with this time to achieve this: tweets ( 'Timestamp' > ' 2013-12-01 16:00:00'

Create a contigency table and a bar plot showing the number of tweets depending on the source and whether the tweet is 'later' or not.

3c: think of a simple hypothesis to explore on this dataset and show the results.

Attachment:- Assignment Files.rar

Reference no: EM132389326

Questions Cloud

Discuss the three different types of hr structural forms : Discuss the three different types of HR structural forms (centralized, decentralized, and transition) and select the one this organization should adopt.

Include in the financial segment : What suggestions would you make to Teresa regarding the kinds of information to include in the financial segment? Be as specific as possible.

Engage in analysis and reflection : ?In this project, you will create a professional presence on LinkedIn (a professional social media network that is widely used by professionals and employers

Review of resourcing and talent management policies : To undertake a review of resourcing and talent management policies and practices in an organisation of your choice and make recommendations for improvement.

Create a contigency table and a bar plot : Create a contigency table and a bar plot showing the number of tweets depending on the source and whether the tweet is 'later' or not

Why physical distractions are usually easier : Which of the following is the reason why physical distractions are usually easier to prevent in a listening or speaking situation?

Developing accounting software packages : The primary business activity of Con Pewter Ltd is developing accounting software packages. Con Pewter charges $2,000 as installation fees and separate twoyear

Examine electronic health systems in health care : Examine the emergence of technology and electronic health systems in health care since the passage of the Health Insurance Portability and Accountability Act.

Improve follower satisfaction with pay and benefits : How can leaders employ the focusing illusion to improve follower satisfaction with pay and benefits? What aspects of the job can employees focus on besides pay

User Account

All Pages