Create topics along with the probability distribution

Assignment Help Python Programming

Reference no: EM132917315

Final Assignment

Part 1:

Step 1: Read the Tripadvisor hotel reviews dataset

Step 2: Create a diagram to take a look at the variable "Score" to see if majority of the customer ratings are positive or negative.

Step 3: Create wordclouds to see the most frequently used words in the reviews and save it.

Step 4: Do Sentiment analysis with VADER
• Applying the model on our dataset
• Assign reviews with compound > 0 as positive sentiment, compound < 0 negative sentiment and remove score = 0
• export csv files
• Now that we have classified reviews into positive and negative, let's build wordclouds for each!
• Take a look at the distribution of reviews with sentiment across the dataset and save the diagram

Step 5: Building the classification model
Build the sentiment analysis model! This model will take reviews in as input.
It will then come up with a prediction on whether the review is positive or negative.
This is a classification task, so you will train a simple logistic regression model to do it.

Step 6: Split the Dataframe
The new data frame should only have two columns - "Review", and "sentiment" (the target variable).

Training the sentiment analysis model
80% of the data will be used for training, and 20% will be used for testing.

Step 7: Create a bag of words
Use a count vectorizer from the Scikit-learn library.
Convert the text into a bag-of-words model since the logistic regression algorithm cannot understand text.

Step 8: Logistic Regression
Split target and independent variables Fit model on data
Make predictions:

Step 9: Test the accuracy of your model Find accuracy, precision, recall
Create the classification report

Part 2: Topic Modelling

LDA
Step 1: Import the positive.csv dataset you have created in Part 1 Step 2: Applying LDA on the "Review" column
Step 3: Define number of topics as 5
Step 4: Create topics along with the probability distribution for each word in our vocabulary for each topic.
Step 5: Print the 10 words with highest probabilities for all the five topics
Step 6: Add a column to the original data frame that will store the topic for the reviews.
Step 7: Save the new dataset as: reviews_topic(lda).csv

Non-Negative Matrix Factorization (NMF)
Step 1: Import the positive.csv dataset you have created in Part 1
Step 2: Apply Non-Negative Matrix Factorization (NMF) on the dataset Step 3: Define number of topics as 5
Step 4: Create topics along with the probability distribution for each word in our vocabulary for each topic.
Step 5: Print the 10 words with highest probabilities for all the five topics
Step 6: Add a column to the original data frame that will store the topic for the reviews.
Step 7: Save the new dataset as: reviews_topic(nmf).csv

Attachment:- Reviews Assignment.rar

Reference no: EM132917315

Questions Cloud

What is the present value of the contract : Next five years, plus an additional $100,000 at the end of year 6. If the appropriate discount rate is 7%, what is the present value of this contract?

How much will susan have to invest today : Time of her retirement in 30 years by making a single investment today. If the investment can earn 5% annually, how much will Susan have to invest today?

What is the country opportunity cost of producing phones : What is the country opportunity cost of producing phones in terms of laptops? For a given unit of labor, a country can produce either 578 laptop.

Make an arbitrage profit of : Suppose you observe that 1 EUR = $1.44, 1 BP = $1.60, and 1 EUR= 0.92 BP. if you have access to a 1,000,000 credit line, you could make an arbitrage profit of

Create topics along with the probability distribution : Create topics along with the probability distribution for each word in our vocabulary for each topic and Import the positive.csv dataset you have created

What is the price of the bill today : The face value of the bill is $100,000. If the current market yield on this bill is 3% per annum, what is the price of the bill today?

Experience and familiarity with company : Based on your experience and familiarity with the company, which business-level strategy do you believe the firm is trying to implement?

Non-negative matrix factorization : Create a bag of words - Convert the text into a bag-of-words model since the logistic regression algorithm cannot understand text.

Calculate the initial investment and terminal cashflow : Nufarm Ltd, Calculate the initial investment and terminal cashflow relating to capital expenditure and working capital of this project.

User Account

All Pages