Determine the page which has the earliest edit time

Assignment Help Python Programming

Reference no: EM132011781

Assignment -

In this problem you will be working with data from a collection of Wikipedia edit logs. The file that you will be working with is enwiki-20080103.main.bz2. This file is bzip2 compressed and is about 8.5GB. The file decompresses to a little over 300GB. You will want a part of the file to use while developing your program. Using bait you can decompress the file a little at a time. Once you have decompressed some of it you can recompress it using the bz2 command. The output you create in this part will be used in the second part of the project. You should familiarize yourself with this file before planning out your code.

Make sure you get the correct data set as there is more than one at this link. You may carry out this project using whatever method you like. I developed two solutions: one using Hadoop MapReduce with the Java API on AWS, and another using only python on a computer with a large amount of main memory(64 GB). If you do not own such a computer you can obtain an AWS instance with the desired specifications.

WARNING: Although this data contains only text data there is offensive material contained in it. You are likely to find this once you begin to extract the link data. Given the size of the dataset I am not completely aware of everything which one might find. If you believe that this is likely to be an issue for you please let me know.

1. Execute a job whose output is a file containing lines that consist of tab separated fields where the fields give the following: Article Name, Number of Edits of the Article, Number of Major Edits of the Article, Number of Out links, Number of In links, Number of Distinct Editors, and the time of the earliest edit. Thus there are seven tab separated fields. Here by Article Name we mean anything that is an article that was edited or something that was linked to by an actual article. So Article Name can include things like image files. Do not include External links. It is thus the case that some of the fields may not exist. You can use 0 for all missing fields except the earliest edit where you should use something indicating absence, if no edit times can be associated with it.

2. Determine the directed graph which associates articles to the objects to which they link, excluding External Links. Each line of output will be an article followed by an object to which it links where the two fields are tab separated.

In this part you will work with the data you generated from the above in section 1 & 2. For these problems you are free to use any method you like. In fact you are encouraged to choose whatever tool you feel best fits the problem.

3. Using the data from problem 2 of the first part of the project, remove all edges which do not connect actual pages to one another. By actual page we will simply refer to something which is the subject of an edit and not something that is only linked to by pages. Then perform PageRank on the topic graph. Use a β of .85. Submit a file which gives each page together with its PageRank. Fields should be tab separated and the data should be sorted by PageRank in descending order.

4. In this problem you will think of, and execute, a series of queries on your table data from problem 1 of the first part. A page refers to something which was the subject of at least one edit. First perform the following:

(a) Determine the page which has been edited the most number of times.

(b) Determine the page which has the largest number of distinct editors.

(d) Determine the object which has the largest number of in links.

(e) Determine the page which has the largest number of outlines.

(f) Determine the number of pages which have no outlines.

Now think of four more queries to perform. In your submission include the results of each of the ten queries you performed. Also include a description of the four additional queries you performed.

Attachment:- Assignment Files.rar

Reference no: EM132011781

Questions Cloud

How each will be utilized or relevant to your writing : For your project, you are required to have at least three (3) external sources for your research references. List THREE references here.

Why is delegating important in a leadership role : Why is delegating important in a leadership role? Why do some managers find it so difficult to delegate?

Managed in planning for innovation : What are the key resources that must be managed in planning for innovation? How is the mix of the resources different for product and process innovation?

What conclusion did you arrive at : Write a survey of questions. Contact a minimum of two companies concerning these functions.Preferably a small company and a large company.

Determine the page which has the earliest edit time : In this problem you will think of, and execute, series of queries on your table data from problem 1 of first part. Determine page which has earliest edit time

Design a customer service job : What measures can an employer take to design a customer service job to make it both efficient for the company and motivating for the employee?

Discounting two years what is the present value : Discounting Two Years What is the present value of $510 in two years when the discount rate is 6 percent?

How to maintain the electronic medical record system : Detail three (3) measures that you would use in order to maintain the electronic medical record system during the emergency.

Development of case management program : BetterCare Insurance Company is considering the development of a case management program for its insured diabetics. B

Reviews

len2011781

6/8/2018 12:45:02 AM

Detailed Question: Please read and review attached instruction file carefully before accepting the assignment. Only accept if you're sure to deliver it. Also please note that data file is too big to be attached herewith; if you can provide me an email I can send you the google docs link for the data file required for this assignment. WARNING: Although this data contains only text data there is offensive material contained in it. You are likely to find this once you begin to extract the link data. Given the size of the dataset I am not completely aware of everything which one might find. If you believe that this is likely to be an issue for you please let me know.

Write a Review

Required(*) Message

User Account

All Pages