Data warehousing and big data assignment

Assignment Help Other Subject
Reference no: EM133164790

BUS5WB Data Warehousing and Big Data Assignment

The third assignment focuses on Big Data analytics on unstructured text data using Microsoft Azure. You are required to derive insights by applying big data distributed processing and machine learning techniques.

Dataset 1 - Amazon Reviews

The dataset contains ~10000 reviews of Amazon products. The fields are;

What you are required to do

1 HD Insight to Analyse Reviews
Develop an aggregate of these reviews using your knowledge of Hadoop and MapReduce in Microsoft HDInsight.

a) Follow the same approach as the Big Data Analytics Workshop (using the wordcount method in HDInsight) to determine the contributory words for each level of rating.
b) Present the workflow of using HDInsight (you may use screen captures) along with a summary of findings and any insights for each level of rating. MapReduce documentation for HDInsight is available here.

You may either create your own Hadoop Cluster or make use of the one provided to run your analysis. The details of the cluster will be provided on the LMS under the section for Assignment 3.

2 Azure Databricks for Big Data Processing
Use the period of data allocated (it will be a single year) to you on the New York City Taxi & Limousine Commission dataset on Azure Databrick to answer the questions below;

a) Plot a visual to show by month for the total fare amount generated by taxi trips with 4 or less passengers have been paid for by credit card. (You will have 12 records)

b) Plot a visual to show the average cost per mile of a taxi ride in each month of the year assigned to you that travelled more than 5 miles, but less than 20 miles grouped by whether the trip was to the airport. (You will have 24 records)
c) Plot a visual to show the day of the week the average number of taxi trips with a single passenger? (You will have 7 records)
d) What are top 10 most profitable routes (in terms of source and destination) for a taxi? (You will have 10 records)

For each of the questions above provide;
• A screenshot of the visual
• A table of the values
• The code that you used to generate it
You will make use of the Azure Databrick cluster which is allocated to you. The details of the cluster will be provided on the LMS under the section Assignment 3. The year allocated to you for analysis will also be shared with you on the LMS.

3 Azure Machine Learning for Prediction

Based on the year assigned to you in the New York City Taxi Dataset (as given in question 2 above) use Azure ML Studio to build a model that predicts the total ride duration of taxi trips in New York City.

Provide the following:
a) A screen capture of the completed model diagram and any decision you made in training the model. For example, rationale for some of the components used, how many records have been used for training and how many for testing.
b) A set of metrics which presents how effective your model is.
c) Which features were most influential in driving your model?
d) Using your model predict the total trip duration for trips given below.

You will make use of the Azure Machine Learning Studio that has been allocated to you. Information regarding accessing the application can be found in the LMS under the section Assignment 3.

The datasets which are required for training and testing are available in Azure Machine Learning Studio further information has been provided in the LMS under section Assignment 3.

Reference no: EM133164790

Questions Cloud

Explain the meaning of the phrase contract out : Explain the meaning of the phrase "contract out" of bargaining unit work and provide reasons why unions seek restrictions on contracting out.
Differences between dismissal and discharge : Q1: Why do experts strongly recommend that performance problems be considered separately from conduct or behavior problems?
What must the monthly income be for the year : Holly Meadows Golf Course is for sale. See attached summary. To achieve a capitalization rate of 9%, what must the monthly income be for the year
How interest-based bargaining can be used to resolve : How interest-based bargaining can be used to resolve difficult disputes and provide examples
Data warehousing and big data assignment : Big Data analytics on unstructured text data using Microsoft Azure. You are required to derive insights by applying big data distributed processing
Handling customer issues such as product installation issues : You are employed full time for a local organization called Halleck, Inc. as a customer service representative. Halleck, Inc. is a call center which specializes
Importance of a formal economic development strategy : Explain the importance of a formal economic development strategy for a municipality, compared to a strategy of responsiveness to individual requests of communit
Identify a canadian leader who had moral and ethical issues : Identify a Canadian leader who had moral and ethical issues. Identify his/her moral and ethical issues. This could be someone in politics, in your work, communi
Explain why equal treatment of stakeholders is not essential : Explain why equal treatment of stakeholders is not essential and why this is appropriate. What criteria should be used to determine which stakeholders are given

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd