Derive insights by applying big data distributed processing

Assignment Help Database Management System

Reference no: EM132323737

Data Warehousing and Big Data Assignment - Big Data Analytics

The assignment focuses on Big Data analytics on unstructured text data using Microsoft Azure. You are required to derive insights by applying big data distributed processing and machine learning techniques.

Dataset - TripAdvisor reviews

The dataset contains ~10000 reviews of hotels in Vietnam. The fields are;

Field name - Description

PostId - Unique ID for each review.

Subject - The heading of the review.

Rating - The rating given by the user, ranging from 1 to 5.

Hotel - Name of the hotel.

Hotel-Star - Star rating of the hotel.

Review - Textual description of the review. The review field text has been pre-processed by removing non-alphanumeric characters and reduced the size to be max of 1000 characters.

Sentiment - Sentiment value extracted for the 'review' using text analytics API in Microsoft Cognitive Services. Sentiment score ranges from 0 to 1, 0 being negative and 1 being positive. (You may verify the sentiment score with the subject and review fields.)

Sentiment Category - Sentiment class generated based on the sentiment value.

< 0.9: Very positive

< 0.75 and <= 0.9: Positive

<0.5 and <= 0.75: Neutral

<= 0.5: Negative

What you are required to do -

1. HDInsight to aggregate reviews

Develop an aggregate of these reviews using your knowledge of Hadoop and MapReduce in Microsoft HDInsight.

a) Follow the same approach as the Big Data analytics workshop (using the wordcount method in HDInsight) to determine the contributory words for each level of rating and sentiment category.

b) Present the workflow of using HDInsight (you may use screen captures) along with a summary of findings for each level of rating and sentiment category.

2. Azure Machine Learning for sentiment analysis

Use Azure ML Studio to analyse customer reviews based on sentiment score. Use the 'review' field for text clustering. In Filter based feature selection module, use 'sentiment' field in order to cluster reviews based on sentiment score. Download the cluster outputs into a csv file to interpret the results and derive insights. You will need to calibrate algorithmic parameters by using different Number of Centroids and Distance Metric to derive meaningful clusters. Exclude sentiment, rating or postid as selected columns to train the clustering model. Use only the preprocessed hashing features.

Provide the following,

a) A screen capture of the completed model diagram.

b) Details of parameters used for 1) feature hashing module, 2) filter based feature selection module and

3) k-means clustering module

c) Details of the approach you chose for clustering and interpretation of clusters.

3. Findings

Summarise your findings from 1) and 2), on user rating, hotel rating and sentiment towards accommodation options in Vietnam. Consider the challenges you faced in conducting Big Data analytics on a real-life text dataset.

Deliverables -

1. A report on the three activities.

The report should be compiled in Microsoft Word only, font size 11.
Report should not exceed 10 pages. Diagrams, tables and any other visualisations/ screen captures should be in the main body of the report.
Make realistic assumptions on missing information and state these in the report.

2. A compressed folder of any other files that would be useful to assess your work.

Attachment:- Assignment Files.rar

Reference no: EM132323737

Questions Cloud

Real estate and renovation of the real estate : You have a project that will involve the purchase of real estate and renovation of the real estate.

Prepare summary journal entries to record the transactions : Prepare summary journal entries to record the transactions for a company in its first month of operations. Raw materials purchased on account, $96,000.

Possibility of a home current value dropping : Can you please help explain why is it illogical to dismiss the possibility of a home's current value dropping below its original purchase price phenomenon

Six paralegals and firing attorneys : Would your company save money in the writing of the sixty legal briefs by hiring the six paralegals and firing some attorneys?

Derive insights by applying big data distributed processing : BUS5WB - Data Warehousing and Big Data Assignment, La Trobe University, Australia. Derive insights by applying big data distributed processing

Analyze the trends in overall inflation : Choose a product or service you currently consume/use, such as apparel or educational services, that is included in the CPI's "market basket."

Calculate the gini coefficient : How can we explain the Lorenz Curve, how it is used to calculate the Gini Coefficient? And what does the Gini Coefficient tell us?

Total revenue and total variable costs for group : How do I calculate the total revenue and total variable costs for each group?

Determine a function for the volume : MHF4U kl+, Advanced Functions, 12, University - Virtual High School-Canada-Describe how to use both an equivalent trigonometric identity and a diagram.

Reviews

len2323737

6/17/2019 1:53:37 AM

Assignment Type: Individual. IMPORTANT: Student accounts have limited azure credits. You must create and decommission (delete) the HDinsight clusters each time you attempt the assignment. If you are planning to work on the assignment across multiple days, remember to delete and recreate each time. Deliverables - A report on the three activities. The report should be compiled in Microsoft Word only, font size 11. Report should not exceed 10 pages. Diagrams, tables and any other visualisations/ screen captures should be in the main body of the report. Make realistic assumptions on missing information and state these in the report. A compressed folder of any other files that would be useful to assess your work.

6/17/2019 1:53:30 AM

Marking Criteria - HDInsight to aggregate reviews 5 marks - A complete attempt at deriving insights using HDInsight. Azure Machine Learning for sentiment analysis 10 marks - A complete attempt at clustering and cluster analysis. Findings 10 marks - A comprehensive effort, accounting for all findings and further analysis.

Write a Review

Required(*) Message

User Account

All Pages