Write a MapReduce program in Hadoop

Assignment Help Computer Engineering
Reference no: EM131910715

Mining Big Data Assignment: Basics and Map-Reduce

Exercise 1 - Suspected Pairs

Using the information from the first lecture (or Section 1.2.3 in the textbook), what would be the number of suspected pairs if the following changes were made to the data (all changes should be applied at once).

1. The number of days of observation was raised to 5000.

2. The number of people observed was raised to 5 billion (and there were therefore 500,000 hotels).

3. We only reported a pair as suspect if they were at the same hotel at the same time on four different days.

Exercise 2 - Hadoop

For this exercise, you have to set up and configure your system to use Hadoop. Follow the instructions in Stanford document (in attached file) and set up the virtual machine as described in Section 1. Run the example program of Section 2 and carry out the different steps given in that section.

  • Write your own Hadoop Map-Reduce job that outputs the number of words that start with each letter (see Sections 2.5 and 3 of the Stanford document).
  • Run your job on the file in standalone mode and pseudo-distributed mode and record the output.

Exercise 3 - Friend Recommendation System (Stanford)

Write a MapReduce program in Hadoop that implements a simple People You Might Know social network friendship recommendation algorithm. The key idea is that if two people have a lot of mutual friends, then the system should recommend that they connect with each other. You have to run the program on the system setup in Exercise 2 in order to receive points for this exercise.

Input: Download the input file, the input file contains the adjacency list and has multiple lines in the following format:

<User><TAB><Friends>

Here, <User> is a unique integer ID corresponding to a unique user and <Friends> is a comma separated list of unique IDs corresponding to the friends of the user with the unique ID <User>. Note that the friendships are mutual (i.e., edges are undirected): if A is friend with B then B is also friend with A. Algorithm: Let us use a simple algorithm such that, for each user U, the algorithm recommends N = 10 users who are not already friends with U, but have the most number of mutual friends in common with U.

Output: The output should contain one line per user in the following format:

<User><TAB><Recommendations>

where <User> is a unique ID corresponding to a user and <Recommendations> is a comma separated list of unique IDs corresponding to the algorithms recommendation of people that <User> might know, ordered in decreasing number of mutual friends. Even if a user has less than 10 second-degree friends, output all of them in decreasing order of the number of mutual friends. If there are recommended users with the same number of mutual friends, then output those user IDs in numerically ascending order. Also, please provide a description of how you are going to use MapReduce jobs to solve this problem. Do not write more than 3 to 4 sentences for this: we only want a very high-level description of your strategy to tackle this problem. Note: It is possible to solve this question with a single MapReduce job. But if your solution requires multiple map reduce jobs, then that is fine too.

For your submission

  • Include your source code
  • Include in your writeup a short paragraph describing your algorithm to tackle this problem.
  • Include in your writeup the recommendations for the users with following user IDs: 924, 8941, 8942, 9019, 9020, 9021, 9022, 9990, 9992, 9993.

Exercise 4 - MapReduce

This exercise has 4 parts. In this exercise, you will be writing and implementing two MapReduce programs. Both are a bit challenging, but they will help you to have a better understanding about the MapReduce implementation. After you write the programs, you will need to answer some questions about them.

Remember that neither problem is case sensitive, so transform words to lowercase or uppercase. Also remember to use the StringTokenizer to find the correct answers.

Part 1: Write a program that processes the FirstInputFile and the SecondInputFile (attached). This program should count the number of words with a specific amount of letters in these files - for example, the number of words with 4 letters, 5 letters and so on. If one word is repeated 20 times in the text, count it individually 20 times.

Part 2: Answer Questions 1-6.

Q1: How many words are there with length 10 in FirstInputFile?

Q2: How many words are there with length 4 in FirstInputFile?

Q3: What is the longest length between words and what is its frequency in FirstInputFile?

Q4: How many words are there with length 2 in SecondInputFile?

Q5: How many words are there with length 5 in SecondInputFile?

Q6: What is the most frequent length and what is its frequency in SecondInputFile?

Part 3: Write a second program that again processes the FirstInputFile (attached) and the SecondInputFile (attached). However, in addition to counting the number of words with a specific amount of letters, if one word is repeated several times, count it only once. So, your output should be the frequency of words with same length, but count a repeated word only once. Note: You may need to use 2 MapReduce jobs.

Part 4: Answer Questions 7-12.

Q7: How many words are there with length 10 in FirstInputFile?

Q8: How many words are there with length 4 in FirstInputFile?

Q9: What is the most frequent length and what is its frequency in FirstInputFile?

Q10: How many words are there with length 5 in SecondInputFile?

Q11: How many words are there with length 2 in SecondInputFile?

Q12: What is the second-most frequent length and what is its frequency in SecondInputFile?

Exercise 5 - Summary of 2.4 and 2.5

For this exercise you have to read Section 2.3.9-2.3.11, 2.4, and 2.5 in Leskovec, Rajaraman, Ullman (second edition, 2014).

  • Summarize the content of 2.4 in your own words (600 words).
  • Summarize the content of 2.5 in your own words (600 words).

Attachment:- Assignment File.rar

Reference no: EM131910715

Questions Cloud

How there are consequences related to changes in strategies : Emphasize how there are consequences related to changes in strategies and priorities and in the way the departments adjust.
How would you address the given moral issue : Consider this question in light of the scenario: Suppose you are a vocal critic of the Affordable Healthcare. How would you address this moral issue? ?
What iq represents the 24th ?percentile : Use the Normal model ?N(104?,20?) for the IQs of sample participants. ?a) What IQ represents the 24th ?percentile?
What strategy would you pursue to successfully manage : What strategy(ies) would you pursue to successfully manage your company now and in the subsequent transition to the growth stage?
Write a MapReduce program in Hadoop : COMP SCI 3306, COMP SCI 7306 Mining Big Data Assignment: Basics and Map-Reduce. Write a MapReduce program in Hadoop that implements a simple People
How many standard deviations from the mean is that : Mario's weekly poker winnings have a mean of $395 and a standard deviation of $59. Last week he won $175. How many standard deviations from the mean is that?
What is the probability of getting 3 or more : If 12 drivers are randomly selected, what is the probability of getting 3 or more who were involved in a car accident last year?
Who is being affected by the ethical issue and how : Who is being affected by the ethical issue? How? Is this information present? Has the ethical issue been resolved in some manner?
Process of taking random samples of size : Suppose the process of taking random samples of size 20 is repeated 200 times and a histogram of the 200 sample means is created.

Reviews

len1910715

3/22/2018 2:45:11 AM

Procedure for handing in the assignment - Work should be handed in using Canvas. The submission should include: pdf file of your solutions for theoretical assignments. all source files, descriptions as required in the statement of the exercises, Hadoop outputs for the exercises, a README.txt file containing instructions to run the code, the names, student numbers, and email addresses of the group members, only one submission per group. In addition, there will be a discussion session where you have to explain your solutions.

Write a Review

Computer Engineering Questions & Answers

  Mathematics in computing

Binary search tree, and postorder and preorder traversal Determine the shortest path in Graph

  Ict governance

ICT is defined as the term of Information and communication technologies, it is diverse set of technical tools and resources used by the government agencies to communicate and produce, circulate, store, and manage all information.

  Implementation of memory management

Assignment covers the following eight topics and explore the implementation of memory management, processes and threads.

  Realize business and organizational data storage

Realize business and organizational data storage and fast access times are much more important than they have ever been. Compare and contrast magnetic tapes, magnetic disks, optical discs

  What is the protocol overhead

What are the advantages of using a compiled language over an interpreted one? Under what circumstances would you select to use an interpreted language?

  Implementation of memory management

Paper describes about memory management. How memory is used in executing programs and its critical support for applications.

  Define open and closed loop control systems

Define open and closed loop cotrol systems.Explain difference between time varying and time invariant control system wth suitable example.

  Prepare a proposal to deploy windows server

Prepare a proposal to deploy Windows Server onto an existing network based on the provided scenario.

  Security policy document project

Analyze security requirements and develop a security policy

  Write a procedure that produces independent stack objects

Write a procedure (make-stack) that produces independent stack objects, using a message-passing style, e.g.

  Define a suitable functional unit

Define a suitable functional unit for a comparative study between two different types of paint.

  Calculate yield to maturity and bond prices

Calculate yield to maturity (YTM) and bond prices

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd