Create a contigency table and a bar plot

Assignment Help Other Subject
Reference no: EM132389326

1. Preprocessing

1a: Write a function preprocess (text, stop_words) which performs these steps:

  • word tokenization (with NLTK)
  • remove punctuation
  • remove "stop words" from the text; the stop words are given as a set; the words should be matched case insensitively.

To get all points, use one or more list comprehensions to achieve the filtering.

When using Topic Models, it is common to chop long texts into chunks with a fixed number of tokens. For example, we might want to chop a novel into chunks of exactly 1000 tokens (regardless of where sentences or chapters end).

1b: Write a function chunker(tokens, n) that takes a text as string, and returns a list with chunks of n tokens each. Hint: the extra optional arguments to range() might come in handy.

1c: Integrate the previous two functions into a function chunk_text (input_filename, output_filename, stop_words, chunk_length) that:

  • reads a file
  • calls the preprocess function on the result
  • calls the chunker function on that result
  • writes those chunks to a file, with one chunk per line.

2. Movie scripts

The year is 1994. You are the agent of Tom Cruise. Tom is offered the part of "Ethan- in the movie Mission Impossible. However, Tom only wants the part if his role is so big that he has more than 2.5x as many lines as the second-most prominent character. Unfortunately, you don't have the time to read the script (it is in the attached file mi.txt), so you will have to write a program that counts the number of lines each character has. Fortunately, movie scripts are formatted in a very particular way, with a certain number of spaces for each type of line. Recall the exercise in week 3 about Romeo & Juliet which solved a related problem.

2a: Your mission, should you choose to accept it, is to plot the top 20 characters with the most lines of dialogue in the script in a bar plot. The plot should have names on the y-axis and the number of lines for each character on the x-axis. Note that "ETHAN (CONT'D)" is not a name, the part in parentheses should be stripped off; for simplicity, count it as a separate line of dialogue even though it indicates a continuation of a previous line.

2b: Now show a similar bar plot for the sequel, Mission Impossible 2: mi2.txt. Make sure your code is re-usable so that you don't have to repeat a lot of code to do this.

2c: Adapt your function so that it produces a Series with the lines of dialogue; the index should have the name of the character that's speaking. Show the first 5 lines in the script.

3. Tweets

We will look at tweets related to a crisis to analyze how different parties communicate about crises. In particular, we will look at the 2013 NY train crash. The data is in the directory 2013_NY_train_crash.

3a: Load the file called '2013_ NY_train_crash-tweets_labeled.csv' into a dataframe called tweets. You may want to rename the columns, because the column names include leading spaces which is error prone.

This file does not contain timestamps. For that load the other file '2013_NY_train_crash-tweetids_entire_period.csv. Pass the option parse_dates=[ 'Timestamp' ] to properly load the timestamps as times instead of strings. This file contains duplicate rows which cause problems. Find a Pandas method to drop the duplicate rows.

Now take the column with the "Timestamp" and add it to the tweets DataFrame as a new column. Note that the timestamps are in the UTC timezone, not local NY time (EST timezone).

You should now have a DataFrame with tweets, timestamps and three other columns with manually annotated labels about each tweet. Show the first 5 rows in the dataframe.

3b: we are interested in knowing how different parties report about victims. For example: who is quicker to report on a disaster, the media or outsiders? Does the former react to the latter, or vice versa?

Select all tweets about "Affected individuals". For the resulting tweets we are interested in contrasting those that are close in time to the disaster (before 16:00 UTC), with tweets which are sent later. Add a column 'later' indicating whether the tweet was after '2013-12-01  16:00:00' . Note that you can compare the Timestamp column to a string with this time to achieve this: tweets ( 'Timestamp' > ' 2013-12-01 16:00:00'

Create a contigency table and a bar plot showing the number of tweets depending on the source and whether the tweet is 'later' or not.

3c: think of a simple hypothesis to explore on this dataset and show the results.

Attachment:- Assignment Files.rar

Reference no: EM132389326

Questions Cloud

Discuss the three different types of hr structural forms : Discuss the three different types of HR structural forms (centralized, decentralized, and transition) and select the one this organization should adopt.
Include in the financial segment : What suggestions would you make to Teresa regarding the kinds of information to include in the financial segment? Be as specific as possible.
Engage in analysis and reflection : ?In this project, you will create a professional presence on LinkedIn (a professional social media network that is widely used by professionals and employers
Review of resourcing and talent management policies : To undertake a review of resourcing and talent management policies and practices in an organisation of your choice and make recommendations for improvement.
Create a contigency table and a bar plot : Create a contigency table and a bar plot showing the number of tweets depending on the source and whether the tweet is 'later' or not
Why physical distractions are usually easier : Which of the following is the reason why physical distractions are usually easier to prevent in a listening or speaking situation?
Developing accounting software packages : The primary business activity of Con Pewter Ltd is developing accounting software packages. Con Pewter charges $2,000 as installation fees and separate twoyear
Examine electronic health systems in health care : Examine the emergence of technology and electronic health systems in health care since the passage of the Health Insurance Portability and Accountability Act.
Improve follower satisfaction with pay and benefits : How can leaders employ the focusing illusion to improve follower satisfaction with pay and benefits? What aspects of the job can employees focus on besides pay

Reviews

Write a Review

Other Subject Questions & Answers

  How the engagement strategy will be implemented

How the engagement strategy will be implemented within the classroom. An example of developmentally appropriate learning activity that utilizes each engagement.

  How personal values might conflict with one ethical standard

How personal values might conflict with or support one ethical standard (e.g., informed consent, privacy and confidentiality, dual relationships, and competence) in the work of the human services professionals?

  Adapter pattern and the facade pattern

The primary difference between the adapter pattern and the facade pattern is that the adapter pattern lets you cope with change in a sub-system after the change has occurred while the facade pattern lets you plan for changes to a sub-system. Do you a..

  Where do most of the world hungry live

As Hite and Seitz (2016) show in Chapter 3: Food, there is enough food produced and available to feed the entire world every day. However, for various reasons.

  The setting has in william faulkner a rose for emily

Define Setting. Write an essay discussing what affect the setting has in William Faulkner's "A Rose for Emily."

  Do the benefits of medical and economic marijuana justify

Do The Benefits Of Medical And Economic Marijuana Justify Its Legality

  United states involvement in international affairs

Should the United States be involved in international affairs? As much as they currently are? I know that this is a really broad question, but I just need a few main points about the United States involvement in international affairs in order to make..

  Define reason why you support either consumer or industry

Become an advocate for either the consumer or the industry. Prepare an argument explaining the major reasons why you support either the consumer or the industry

  Write a business memo whose primary objective is to deliver

write a business memo whose primary objective is to deliver a project status update to general management. the project

  HR software supplier and maintenance services

There are several important questions to ask during the selection process such as does the HR software supplier offer support and maintenance services? If so, what is the cost? How could the answer to this question impact the decision? (150 words)

  Discuss the ethical issues raised in the experiment

Consider the following experiment. Each participant interacted for an hour with another person who was actually a research confederate (an actor working for the researcher). Write a hypothesis that is tested in this study. Discuss the ethical issue..

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd