Analyze the yelp 2016 challenge dataset

Assignment Help Basic Statistics
Reference no: EM131030607

I need urgent help in my Big Data Assignment. It needs another .tgz file to pull the data. I will share the file with you as soon as you contact me. It is in Big Data- Hadoop. Can u get it done in Hadoop/ R/ apache spark??

You must use Hadoop technologies to analyze the Yelp 2016 challenge dataset: https://www.yelp.com/dataset_challenge.

You can use Hadoop, R/ Apache Spark.

Specifically, you must provide the answers (and code) to the following questions:

Summarize the number of reviews by US city, by business category.

Rank the cities by # of stars for each category, by city.

What is the average rank (stars) for businesses within 800 ft of Times Square, by type? For this problem, assume Times Square is at lat: 40° 45' 32.0256'' N, lon: 73° 59' 6.4680'' W, and 800 ft. to be a square 10 seconds in each direction.

Rank reviewers by number of reviews. For the top 10 reviewers, show their average number of stars, by category.

For the top 10 and bottom 10 food business in Times Square (in terms of stars), summarize rating by hour of day.

Changes :Since the Yelp academic dataset does not include NY, we need to amend the coordinates:

Center: Carnegie Mellon University, Pitsburgh, PA
Latitude: 40-26'28'' N, Longitude: 079-56'34'' W

Decimal Degrees: Latitude: 40.4411801, Longitude: -79.9428294

The bounding box for the midterm is ~5 miles, which we will loosely define as 5 minutes. So the bounding box is a square box, 10 minutes each side (of longitude and latitude), with CMU at the center.

provide suitable statistical analysis of your results with R.

provide visualizations for results (distributions, graphs, maps, in R).

Reference no: EM131030607

Questions Cloud

Identified physician compliance risk area coding and billing : Identified Physician Compliance Risk Area - Coding and Billing: Which risk areas are you most concerned about? Why? What can be done to minimize it?
Determine the electric energy supplied in kwh : The pressure inside the cylinder is held constant at 300 kPa during the process, and a heat loss of 60 kJ occurs.
How might that affect the validity of your ratio analysis : If the firm had a pronounced seasonal sales pattern, or if it grew rapidly during the year, how might that affect the validity of your ratio analysis?
Explain the circumstances of the crime : Explain the circumstances of the crime. Describe the investigative process. Describe the circumstances of the arrest. Describe the process beginning with arraignment through sentencing.
Analyze the yelp 2016 challenge dataset : You must use Hadoop technologies to analyze the Yelp 2016 challenge dataset - Summarize the number of reviews by US city, by business category.
Determine the final temperature of the air : Determine the final temperature of the air. Neglect the energy stored in the paddle wheel.
Determine the amount of heat loss : A piston-cylinder device contains 25 ft3 of nitrogen at 40 psia and 700°F. Nitrogen is now allowed to cool at constant pressure until the temperature drops to 200°F. Using specific heats at the average temperature, determine the amount of heat los..
Program accepts a path to a directory : CSC 352, Unix Systems, Spring 2016, Run the file command and egrep for the pattern - This program accepts a path to a DIRECTORY as its first and command line argument, followed by one or more string PATTERNs
Identify all the possible variables and parameters : Describe the structure of the formula you would like to propose for management and identify all the possible variables and parameters which may play a role in such a formula.

Reviews

Write a Review

Basic Statistics Questions & Answers

  Dispersed distribution basics

In 2010, the average age of students at UTC was 22 with a standard deviation of 3.96. In 2013, the average age was 24 with a standard deviation of 4.08. In which year do the ages show a more dispersed distribution? Show your complete work and supp..

  Calculate a sample statistic that estimates the parameter

A parameter is a population characteristic. The numerical value of a parameter usually cannot be determined because we cannot measure all units in the population. We have to estimate the parameter using sample information.

  Explain what is overall satisfaction with bank of america

The response on a scale from 1 to 10 to the question: "Considering all the business you do with Bank of America, what is your overall satisfaction with Bank of America?

  Determine the sampling error if probability is given

It is known that the variance of a population equals 484. A random sample of 81 observations is going to be taken from the population. With a .80 probability, what is the sampling error?

  The journal of the european academy of dermatology

Refer to the Journal of the European Academy of Dermatology and Venereology study of the link between nickel allergies and use of mascara or eye shadow

  You are picking a bouquet of 20 flowers for your mother at

you are picking a bouquet of 20 flowers for your mother at random from a garden with 25 coneflowers 35 daisies and 42

  Find the probability that exactly three experienced insomnia

A random sample of 20 Clarinex-D users is obtained, and the number of patients who experienced insomnia is recorded. Find the probability that exactly 3 experienced insomnia as a side effect.

  Question regarding algebra-polynomials

If a manufacturer charges 'p' dollars each for shirts, then he expects to sell 2000-100p shirts per week. What polynomial represents the total revenue expected for a week?

  Suppose x is a normal random variable with expected value 7

suppose x is a normal random variable with expected value 7 and standard deviation 3. then calculate the following

  Discuss the null hypothesis

Discuss the null hypothesis. How is it framed? What is the alternate hypothesis? How does it relate to the null hypothesis?

  Relationship between entrepreneurial behaviour of firms

Entrepreneurship researchers are interested in the relationship between entrepreneurial behaviour of firms, such as innovation or new business venturing, and performance indicators, such as sales growth or profitability.

  Probability that number of students who want new books

What are the mean value and standard deviation of the number of students who want new books? What is the probability that the number of students who want new books is more than two standard deviations above the mean?

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd