Reference no: EM132011781
Assignment -
In this problem you will be working with data from a collection of Wikipedia edit logs. The file that you will be working with is enwiki-20080103.main.bz2. This file is bzip2 compressed and is about 8.5GB. The file decompresses to a little over 300GB. You will want a part of the file to use while developing your program. Using bait you can decompress the file a little at a time. Once you have decompressed some of it you can recompress it using the bz2 command. The output you create in this part will be used in the second part of the project. You should familiarize yourself with this file before planning out your code.
Make sure you get the correct data set as there is more than one at this link. You may carry out this project using whatever method you like. I developed two solutions: one using Hadoop MapReduce with the Java API on AWS, and another using only python on a computer with a large amount of main memory(64 GB). If you do not own such a computer you can obtain an AWS instance with the desired specifications.
WARNING: Although this data contains only text data there is offensive material contained in it. You are likely to find this once you begin to extract the link data. Given the size of the dataset I am not completely aware of everything which one might find. If you believe that this is likely to be an issue for you please let me know.
1. Execute a job whose output is a file containing lines that consist of tab separated fields where the fields give the following: Article Name, Number of Edits of the Article, Number of Major Edits of the Article, Number of Out links, Number of In links, Number of Distinct Editors, and the time of the earliest edit. Thus there are seven tab separated fields. Here by Article Name we mean anything that is an article that was edited or something that was linked to by an actual article. So Article Name can include things like image files. Do not include External links. It is thus the case that some of the fields may not exist. You can use 0 for all missing fields except the earliest edit where you should use something indicating absence, if no edit times can be associated with it.
2. Determine the directed graph which associates articles to the objects to which they link, excluding External Links. Each line of output will be an article followed by an object to which it links where the two fields are tab separated.
In this part you will work with the data you generated from the above in section 1 & 2. For these problems you are free to use any method you like. In fact you are encouraged to choose whatever tool you feel best fits the problem.
3. Using the data from problem 2 of the first part of the project, remove all edges which do not connect actual pages to one another. By actual page we will simply refer to something which is the subject of an edit and not something that is only linked to by pages. Then perform PageRank on the topic graph. Use a β of .85. Submit a file which gives each page together with its PageRank. Fields should be tab separated and the data should be sorted by PageRank in descending order.
4. In this problem you will think of, and execute, a series of queries on your table data from problem 1 of the first part. A page refers to something which was the subject of at least one edit. First perform the following:
(a) Determine the page which has been edited the most number of times.
(b) Determine the page which has the largest number of distinct editors.
(c) Determine the page which has the earliest edit time.
(d) Determine the object which has the largest number of in links.
(e) Determine the page which has the largest number of outlines.
(f) Determine the number of pages which have no outlines.
Now think of four more queries to perform. In your submission include the results of each of the ten queries you performed. Also include a description of the four additional queries you performed.
Attachment:- Assignment Files.rar