Derive insights by applying big data distributed processing

Assignment Help Database Management System
Reference no: EM132323737

Data Warehousing and Big Data Assignment - Big Data Analytics

The assignment focuses on Big Data analytics on unstructured text data using Microsoft Azure. You are required to derive insights by applying big data distributed processing and machine learning techniques.

Dataset - TripAdvisor reviews

The dataset contains ~10000 reviews of hotels in Vietnam. The fields are;

Field name - Description

PostId - Unique ID for each review.

Subject - The heading of the review.

Rating - The rating given by the user, ranging from 1 to 5.

Hotel - Name of the hotel.

Hotel-Star - Star rating of the hotel.

Review - Textual description of the review. The review field text has been pre-processed by removing non-alphanumeric characters and reduced the size to be max of 1000 characters.

Sentiment - Sentiment value extracted for the 'review' using text analytics API in Microsoft Cognitive Services. Sentiment score ranges from 0 to 1, 0 being negative and 1 being positive. (You may verify the sentiment score with the subject and review fields.)

Sentiment Category - Sentiment class generated based on the sentiment value.

< 0.9: Very positive

< 0.75 and <= 0.9: Positive

<0.5 and <= 0.75: Neutral

<= 0.5: Negative

What you are required to do -

1. HDInsight to aggregate reviews

Develop an aggregate of these reviews using your knowledge of Hadoop and MapReduce in Microsoft HDInsight.

a) Follow the same approach as the Big Data analytics workshop (using the wordcount method in HDInsight) to determine the contributory words for each level of rating and sentiment category.

b) Present the workflow of using HDInsight (you may use screen captures) along with a summary of findings for each level of rating and sentiment category.

2. Azure Machine Learning for sentiment analysis

Use Azure ML Studio to analyse customer reviews based on sentiment score. Use the 'review' field for text clustering. In Filter based feature selection module, use 'sentiment' field in order to cluster reviews based on sentiment score. Download the cluster outputs into a csv file to interpret the results and derive insights. You will need to calibrate algorithmic parameters by using different Number of Centroids and Distance Metric to derive meaningful clusters. Exclude sentiment, rating or postid as selected columns to train the clustering model. Use only the preprocessed hashing features.

Provide the following,

a) A screen capture of the completed model diagram.

b) Details of parameters used for 1) feature hashing module, 2) filter based feature selection module and

3) k-means clustering module

c) Details of the approach you chose for clustering and interpretation of clusters.

3. Findings

Summarise your findings from 1) and 2), on user rating, hotel rating and sentiment towards accommodation options in Vietnam. Consider the challenges you faced in conducting Big Data analytics on a real-life text dataset.

Deliverables -

1. A report on the three activities.

  • The report should be compiled in Microsoft Word only, font size 11.
  • Report should not exceed 10 pages. Diagrams, tables and any other visualisations/ screen captures should be in the main body of the report.
  • Make realistic assumptions on missing information and state these in the report.

2. A compressed folder of any other files that would be useful to assess your work.

Attachment:- Assignment Files.rar

Reference no: EM132323737

Questions Cloud

Real estate and renovation of the real estate : You have a project that will involve the purchase of real estate and renovation of the real estate.
Prepare summary journal entries to record the transactions : Prepare summary journal entries to record the transactions for a company in its first month of operations. Raw materials purchased on account, $96,000.
Possibility of a home current value dropping : Can you please help explain why is it illogical to dismiss the possibility of a home's current value dropping below its original purchase price phenomenon
Six paralegals and firing attorneys : Would your company save money in the writing of the sixty legal briefs by hiring the six paralegals and firing some attorneys?
Derive insights by applying big data distributed processing : BUS5WB - Data Warehousing and Big Data Assignment, La Trobe University, Australia. Derive insights by applying big data distributed processing
Analyze the trends in overall inflation : Choose a product or service you currently consume/use, such as apparel or educational services, that is included in the CPI's "market basket."
Calculate the gini coefficient : How can we explain the Lorenz Curve, how it is used to calculate the Gini Coefficient? And what does the Gini Coefficient tell us?
Total revenue and total variable costs for group : How do I calculate the total revenue and total variable costs for each group?
Determine a function for the volume : MHF4U kl+, Advanced Functions, 12, University - Virtual High School-Canada-Describe how to use both an equivalent trigonometric identity and a diagram.

Reviews

len2323737

6/17/2019 1:53:37 AM

Assignment Type: Individual. IMPORTANT: Student accounts have limited azure credits. You must create and decommission (delete) the HDinsight clusters each time you attempt the assignment. If you are planning to work on the assignment across multiple days, remember to delete and recreate each time. Deliverables - A report on the three activities. The report should be compiled in Microsoft Word only, font size 11. Report should not exceed 10 pages. Diagrams, tables and any other visualisations/ screen captures should be in the main body of the report. Make realistic assumptions on missing information and state these in the report. A compressed folder of any other files that would be useful to assess your work.

len2323737

6/17/2019 1:53:30 AM

Marking Criteria - HDInsight to aggregate reviews 5 marks - A complete attempt at deriving insights using HDInsight. Azure Machine Learning for sentiment analysis 10 marks - A complete attempt at clustering and cluster analysis. Findings 10 marks - A comprehensive effort, accounting for all findings and further analysis.

Write a Review

Database Management System Questions & Answers

  Design an entity-relationship model of the problem

You are required to design an entity-relationship model of the problem, convert the model into a relational model, and assess the normal form of each schema.

  Identify the data types and sizes for all attributes

Select database management system (Oracle, SQL Server, MYSQL, etc) and identify the data types and sizes for all attributes

  Create a table and call it books

create database csci2006; Create a table and call it books. Use the following command:

  Discuss the degree to which you believe the visio diagram

Discuss the degree to which you believe the Visio diagram reflects the database design. Submit the design summary as a Microsoft Word file.

  Explain the life cycle of an information systems

Imagine that you run a photography printing store. Your employees have been using punch cards for time entry since you started the business.

  Explain why 4nf is a normal form more desirable than bcnf

Describe 1NF, 2NF, 3NF. Explain why 4NF is a normal form more desirable than BCNF. The response must be typed.

  Quality and correctness of schema design

Create an ER/EER diagram to represent the conceptual schema described by the above Universe of Discourse - You can use MS Word to draw the Entity Relationship diagram, and then use it for mapping step.

  Discuss concept of an index and how they improve performance

Discuss the concept of an index and explain how they improve performance. The goal of a quality physical design is to improve performance while disrupting the logical design as little as possible.

  Write a sql query that shows the empno

Write a SQL query that shows the empno, ename, job and total salary for all employees from the emp table and uses a sub-query to show employees.

  Draw an orm diagram for lineiteminvoice

Draw an ORM diagram for LineItemInvoice. Note, this diagram should be connected to the previous diagrams. You will need a nested object.

  Create a simple query for each table that returns

Create a simple query for each table that returns all of the columns and all of the rows for each table. Write a query that displays each part that has been purchased by Huffman Trucking Company

  Design relation schemas for the entire database

Design relation schemas for the entire database.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd