Skip to content

Fix hash probes#5

Merged
hiway merged 4 commits intohiway:masterfrom
cxsmith:fix_hash_probes
Jul 4, 2021
Merged

Fix hash probes#5
hiway merged 4 commits intohiway:masterfrom
cxsmith:fix_hash_probes

Conversation

@cxsmith
Copy link
Copy Markdown
Contributor

@cxsmith cxsmith commented Apr 14, 2019

I changed the probe logic to make collisions less likely. Using a linear combination is susceptible to false positives because if hash2 has a high GCD with bloom_filter.num_bits_m, then it will not produce very many different hash values (it will produce precisely min(num_probes_k, num_bits_m/GCD(num_bits_m, hash2)), meaning that there are fewer opportunities for it to produce a hash that's not in the filter.

Ideally this should use python's random() functionality, so long as the random generator uses all the entropy of the seeds, but it looks like it uses only an int's worth: https://docs.python.org/3/library/random.html#random.seed

cxsmith added 4 commits April 14, 2019 14:01
- The test filter's max_elements parameter is set to twice the
  number of elements that are being tested as false positives. This
  doesn't test the functionality of the filter, since max_elements
  should be related to the number of elements that we expect to put into
  the filter.
- The tests currently aren't aggressive enough wrt the magnitude of
  false positives being checked, added a test for checking a million
  false positives that fails given expectations.
- The previous linear combination hashing algorithm was flawed because
  the Hamming weight of the entire probe is guaranteed to be equal to
  min(num_bits_n / gcd(num_bits_n, hash2), num_probes_k) since we're
  treating hash2 as a generator of the group num_bits_n/hash2. Since
  the order of that group is the above, that means that the Hamming
  weight of the probe may be significantly smaller than num_probes_k,
  especially if hash2 is a multiple of num_bits_n (which will happen 1
  out of num_bits_n times). A low Hamming weight probe is much more
  likely to be a positive, and therefore also much more likely to be a
  false positive.
@Hello1024
Copy link
Copy Markdown

@hiway Any chance of merging this... This bug bites lots of users in ways that are painful...

remram44 added a commit to remram44/python-bloom-filter that referenced this pull request May 5, 2021
Fix hash probes

See pull request hiway#5
remram44 added a commit to remram44/python-bloom-filter that referenced this pull request May 5, 2021
Fix hash probes

See pull request hiway#5
@hiway hiway merged commit 7652efa into hiway:master Jul 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants