Determine the page which has the earliest edit time

Assignment Help Python Programming
Reference no: EM132011781

Assignment -

In this problem you will be working with data from a collection of Wikipedia edit logs. The file that you will be working with is enwiki-20080103.main.bz2. This file is bzip2 compressed and is about 8.5GB. The file decompresses to a little over 300GB. You will want a part of the file to use while developing your program. Using bait you can decompress the file a little at a time. Once you have decompressed some of it you can recompress it using the bz2 command. The output you create in this part will be used in the second part of the project. You should familiarize yourself with this file before planning out your code.

Make sure you get the correct data set as there is more than one at this link. You may carry out this project using whatever method you like. I developed two solutions: one using Hadoop MapReduce with the Java API on AWS, and another using only python on a computer with a large amount of main memory(64 GB). If you do not own such a computer you can obtain an AWS instance with the desired specifications.

WARNING: Although this data contains only text data there is offensive material contained in it. You are likely to find this once you begin to extract the link data. Given the size of the dataset I am not completely aware of everything which one might find. If you believe that this is likely to be an issue for you please let me know.

1. Execute a job whose output is a file containing lines that consist of tab separated fields where the fields give the following: Article Name, Number of Edits of the Article, Number of Major Edits of the Article, Number of Out links, Number of In links, Number of Distinct Editors, and the time of the earliest edit. Thus there are seven tab separated fields. Here by Article Name we mean anything that is an article that was edited or something that was linked to by an actual article. So Article Name can include things like image files. Do not include External links. It is thus the case that some of the fields may not exist. You can use 0 for all missing fields except the earliest edit where you should use something indicating absence, if no edit times can be associated with it.

2. Determine the directed graph which associates articles to the objects to which they link, excluding External Links. Each line of output will be an article followed by an object to which it links where the two fields are tab separated.

In this part you will work with the data you generated from the above in section 1 & 2. For these problems you are free to use any method you like. In fact you are encouraged to choose whatever tool you feel best fits the problem.

3. Using the data from problem 2 of the first part of the project, remove all edges which do not connect actual pages to one another. By actual page we will simply refer to something which is the subject of an edit and not something that is only linked to by pages. Then perform PageRank on the topic graph. Use a β of .85. Submit a file which gives each page together with its PageRank. Fields should be tab separated and the data should be sorted by PageRank in descending order.

4. In this problem you will think of, and execute, a series of queries on your table data from problem 1 of the first part. A page refers to something which was the subject of at least one edit. First perform the following:

(a) Determine the page which has been edited the most number of times.

(b) Determine the page which has the largest number of distinct editors.

(c) Determine the page which has the earliest edit time.

(d) Determine the object which has the largest number of in links.

(e) Determine the page which has the largest number of outlines.

(f) Determine the number of pages which have no outlines.

Now think of four more queries to perform. In your submission include the results of each of the ten queries you performed. Also include a description of the four additional queries you performed.

Attachment:- Assignment Files.rar

Reference no: EM132011781

Questions Cloud

How each will be utilized or relevant to your writing : For your project, you are required to have at least three (3) external sources for your research references. List THREE references here.
Why is delegating important in a leadership role : Why is delegating important in a leadership role? Why do some managers find it so difficult to delegate?
Managed in planning for innovation : What are the key resources that must be managed in planning for innovation? How is the mix of the resources different for product and process innovation?
What conclusion did you arrive at : Write a survey of questions. Contact a minimum of two companies concerning these functions.Preferably a small company and a large company.
Determine the page which has the earliest edit time : In this problem you will think of, and execute, series of queries on your table data from problem 1 of first part. Determine page which has earliest edit time
Design a customer service job : What measures can an employer take to design a customer service job to make it both efficient for the company and motivating for the employee?
Discounting two years what is the present value : Discounting Two Years What is the present value of $510 in two years when the discount rate is 6 percent?
How to maintain the electronic medical record system : Detail three (3) measures that you would use in order to maintain the electronic medical record system during the emergency.
Development of case management program : BetterCare Insurance Company is considering the development of a case management program for its insured diabetics. B

Reviews

len2011781

6/8/2018 12:45:02 AM

Detailed Question: Please read and review attached instruction file carefully before accepting the assignment. Only accept if you're sure to deliver it. Also please note that data file is too big to be attached herewith; if you can provide me an email I can send you the google docs link for the data file required for this assignment. WARNING: Although this data contains only text data there is offensive material contained in it. You are likely to find this once you begin to extract the link data. Given the size of the dataset I am not completely aware of everything which one might find. If you believe that this is likely to be an issue for you please let me know.

Write a Review

Python Programming Questions & Answers

  Write a python program to implement the diff command

Without using the system() function to call any bash commands, write a python program that will implement a simple version of the diff command.

  Write a program for checking a circle

Write a program for checking a circle program must either print "is a circle: YES" or "is a circle: NO", appropriately.

  Prepare a python program

Prepare a Python program which evaluates how many stuck numbers there are in a range of integers. The range will be input as two command-line arguments.

  Python atm program to enter account number

Write a simple Python ATM program. Ask user to enter their account number, and print their initail balance. (Just make one up). Ask them if they wish to make deposit or withdrawal.

  Python function to calculate two roots

Write a Python function main() to calculate two roots. You must input a,b and c from keyboard, and then print two roots. Suppose the discriminant D= b2-4ac is positive.

  Design program that asks user to enter amount in python

IN Python Design a program that asks the user to enter the amount that he or she has budget in a month. A loop should then prompt the user to enter his or her expenses for the month.

  Write python program which imports three dictionaries

Write a Python program called hours.py which imports three dictionaries, and uses the data in them to calculate how many hours each person has spent in the lab.

  Write python program to create factors of numbers

Write down a python program which takes two numbers and creates the factors of both numbers and displays the greatest common factor.

  Email spam filter

Analyze the emails and predict whether the mail is a spam or not a spam - Create a training file and copy the text of several mails and spams in to it And create a test set identical to the training set but with different examples.

  Improve the readability and structural design of the code

Improve the readability and structural design of the code by improving the function names, variables, and loops, as well as whitespace. Move functions close to related functions or blocks of code related to your organised code.

  Create a simple and responsive gui

Please use primarily PHP or Python to solve the exercise and create a simple and responsive GUI, using HTML, CSS and JavaScript.Do not use a database.

  The program is to print the time

The program is to print the time in seconds that the iterative version takes, the time in seconds that the recursive version takes, and the difference between the times.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd