Reference no: EM131212170
This exercise is based on the entity-resolution problem of Example 22.9. For concreteness, suppose that the only pairs records that could possibly be total edit distance 5 or less from each other consist of a true copy of a record and another corrupted version of the record. In the corrupted version, each of the three fields is changed independently. 50% of the time, a field has no change. 20% of the time, there is a change resulting in edit distance 1 for that field. There is a 20% chance of edit distance 2 and 10% chance of edit distance 10. Suppose there are one million pairs of this kind in the dataset.
a) How many of the million pairs are within total edit distance 5 of each other?
b) If we hash each field to a large number of buckets, as suggested by Example 22.9, how many of these one million pairs will hash to the same bucket for at least one of the three hashings?
c) How many false negatives will there be; that is, how many of the one million pairs are within total edit distance 5, but will not hash to the same bucket for any of the three hashings?
Example 22.9
Suppose for concreteness that records are as in the running example of Section 21.7: name-address-phone triples, where each of the three fields is a character string. Suppose also that we define records to be similar if the sum of the edit distances of their three corresponding pairs of fields is no greater than 5. Let us use a hash function h that hashes the name field of a record to one of a million buckets. How h works is unimportant, except that it must be a good hash function - one that distributes names roughly uniformly among the buckets. But we do not stop here. We also hash the records to another set of a million buckets, this time using the address, and a suitable hash function on addresses. If h operates on any strings, we can even use h. Then, we hash records a third time to a million buckets, using the phone number. Finally, we examine each bucket in each of the three hash tables, a total of 3,000,000 buckets. For each bucket, we compare each pair of records in each bucket, and we report any pair that has total edit distance 5 or less. Suppose there are n records. Assuming even distribution of records in each hash table, there are n/106 records in each bucket. The number of pairs of records in each bucket is approximately n2/( 2 x 1012). Since there are 3 x 106 buckets, the total number of comparisons is about 1.5n2/106. And since there are about ra2/ 2 pairs of records, we have managed to look at only fraction 3 x 10-6 of the records, a big improvement.
Develop a java application that inputs the salesperson
: Develop a Java application that inputs the salesperson's gross sales for that item for last week and calculates and displays that salesperson's earnings. There is no limit to the number of items sold. After the loop is done, print out the aggregat..
|
Alexander falconbridge an account of the slave trade
: What does Falconbridge's account of the slave trade and the middle passage tell us about the nature of the Atlantic Slave trade?
|
What are the purpose of change management
: Question 1: What are the purpose of change management? Question 2: What is a relationship between changeq Incidentq Service Request q Release Question 3: What are the Fiverisk indicators of poor change management ?
|
How many bits are needed for the opcode
: A digital computer has a memory unit with 16 bits per word. The instruction set consists of 122 different operations. All instructions have an operation code part (opcode) and an address part (allowing for only one address). How many bits are need..
|
How many of these one million pairs will hash to the bucket
: If we hash each field to a large number of buckets, as suggested by Example 22.9, how many of these one million pairs will hash to the same bucket for at least one of the three hashings?
|
Picture of systems analysis and systems development
: How does the Internet, and more specifically the World Wide Web, fit into the picture of systems analysis and systems development?
|
The histories the second persian invasion of greece
: Read given file, Herodotus, The Histories, The Second Persian Invasion of Greece. - And discuss should contain a thesis statement, evidence from the texts to support argument.
|
Identify the economic environment
: Using your chosen company's domestic and global environments identify the economic environment of each and compare and contrast it using Rostow and Galbraith (see lesson plan and resources below).
|
Maintain a word-readable document
: While you are working on the project, maintain a Word-readable document (.docx, .doc, .rtf, or .txt) that lists the tasks you experience problems with. Are there any tasks that cannot translate directly from one language to another? How did you ha..
|