How many of these one million pairs will hash to the bucket

Assignment Help Basic Computer Science
Reference no: EM131212170

This exercise is based on the entity-resolution problem of Example 22.9. For concreteness, suppose that the only pairs records that could possibly be total edit distance 5 or less from each other consist of a true copy of a record and another corrupted version of the record. In the corrupted version, each of the three fields is changed independently. 50% of the time, a field has no change. 20% of the time, there is a change resulting in edit distance 1 for that field. There is a 20% chance of edit distance 2 and 10% chance of edit distance 10. Suppose there are one million pairs of this kind in the dataset.

a) How many of the million pairs are within total edit distance 5 of each other?

b) If we hash each field to a large number of buckets, as suggested by Example 22.9, how many of these one million pairs will hash to the same bucket for at least one of the three hashings?

c) How many false negatives will there be; that is, how many of the one million pairs are within total edit distance 5, but will not hash to the same bucket for any of the three hashings?

Example 22.9

Suppose for concreteness that records are as in the running example of Section 21.7: name-address-phone triples, where each of the three fields is a character string. Suppose also that we define records to be similar if the sum of the edit distances of their three corresponding pairs of fields is no greater than 5. Let us use a hash function h that hashes the name field of a record to one of a million buckets. How h works is unimportant, except that it must be a good hash function - one that distributes names roughly uniformly among the buckets. But we do not stop here. We also hash the records to another set of a million buckets, this time using the address, and a suitable hash function on addresses. If h operates on any strings, we can even use h. Then, we hash records a third time to a million buckets, using the phone number. Finally, we examine each bucket in each of the three hash tables, a total of 3,000,000 buckets. For each bucket, we compare each pair of records in each bucket, and we report any pair that has total edit distance 5 or less. Suppose there are n records. Assuming even distribution of records in each hash table, there are n/106 records in each bucket. The number of pairs of records in each bucket is approximately n2/( 2 x 1012). Since there are 3 x 106 buckets, the total number of comparisons is about 1.5n2/106. And since there are about ra2/ 2 pairs of records, we have managed to look at only fraction 3 x 10-6 of the records, a big improvement.

Reference no: EM131212170

Questions Cloud

Develop a java application that inputs the salesperson : Develop a Java application that inputs the salesperson's gross sales for that item for last week and calculates and displays that salesperson's earnings. There is no limit to the number of items sold. After the loop is done, print out the aggregat..
Alexander falconbridge an account of the slave trade : What does Falconbridge's account of the slave trade and the middle passage tell us about the nature of the Atlantic Slave trade?
What are the purpose of change management : Question 1: What are the purpose of change management? Question 2: What is a relationship between changeq Incidentq Service Request q Release Question 3: What are the Fiverisk indicators of poor change management ?
How many bits are needed for the opcode : A digital computer has a memory unit with 16 bits per word. The instruction set consists of 122 different operations. All instructions have an operation code part (opcode) and an address part (allowing for only one address). How many bits are need..
How many of these one million pairs will hash to the bucket : If we hash each field to a large number of buckets, as suggested by Example 22.9, how many of these one million pairs will hash to the same bucket for at least one of the three hashings?
Picture of systems analysis and systems development : How does the Internet, and more specifically the World Wide Web, fit into the picture of systems analysis and systems development?
The histories the second persian invasion of greece : Read given file, Herodotus, The Histories, The Second Persian Invasion of Greece. - And discuss should contain a thesis statement, evidence from the texts to support argument.
Identify the economic environment : Using your chosen company's domestic and global environments identify the economic environment of each and compare and contrast it using Rostow and Galbraith (see lesson plan and resources below).
Maintain a word-readable document : While you are working on the project, maintain a Word-readable document (.docx, .doc, .rtf, or .txt) that lists the tasks you experience problems with. Are there any tasks that cannot translate directly from one language to another? How did you ha..

Reviews

Write a Review

Basic Computer Science Questions & Answers

  Identifies the cost of computer

identifies the cost of computer components to configure a computer system (including all peripheral devices where needed) for use in one of the following four situations:

  Input devices

Compare how the gestures data is generated and represented for interpretation in each of the following input devices. In your comparison, consider the data formats (radio waves, electrical signal, sound, etc.), device drivers, operating systems suppo..

  Cores on computer systems

Assignment : Cores on Computer Systems:  Differentiate between multiprocessor systems and many-core systems in terms of power efficiency, cost benefit analysis, instructions processing efficiency, and packaging form factors.

  Prepare an annual budget in an excel spreadsheet

Prepare working solutions in Excel that will manage the annual budget

  Write a research paper in relation to a software design

Research paper in relation to a Software Design related topic

  Describe the forest, domain, ou, and trust configuration

Describe the forest, domain, OU, and trust configuration for Bluesky. Include a chart or diagram of the current configuration. Currently Bluesky has a single domain and default OU structure.

  Construct a truth table for the boolean expression

Construct a truth table for the Boolean expressions ABC + A'B'C' ABC + AB'C' + A'B'C' A(BC' + B'C)

  Evaluate the cost of materials

Evaluate the cost of materials

  The marie simulator

Depending on how comfortable you are with using the MARIE simulator after reading

  What is the main advantage of using master pages

What is the main advantage of using master pages. Explain the purpose and advantage of using styles.

  Describe the three fundamental models of distributed systems

Explain the two approaches to packet delivery by the network layer in Distributed Systems. Describe the three fundamental models of Distributed Systems

  Distinguish between caching and buffering

Distinguish between caching and buffering The failure model defines the ways in which failure may occur in order to provide an understanding of the effects of failure. Give one type of failure with a brief description of the failure

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd