"

Bloom Filter

A Bloom filter implements a set and has the following key properties:


The last point means the following:

Implementation

A Bloom Filter is implemented using a bit vector, v, of length m, and a k hash functions, h1, h2, …, hk which returns valid vector indexes, 0…m − 1.

To insert an element x, set bits h1(x), h2(x), …, hk(x) to 1.

insert(x):
    for hash in [h1(x), h2(x), …, hk(x)]
        vector[hash] = 1

To see if an element x is stored in the set, check that the bits h1(x), h2(x), …, hk(x) are set to 1.

contains(x):
    for hash in [h1(x), h2(x), …, hk(x)]
        if vector[hash] == 0
            return NOT_IN_SET
    return MAYBE_IN_SET

If we are unlucky, the hash values of the already inserted elements covers all hash values of some not-yet-inserted element e. Contains(e) will then falsly conclude that e looks to be in the set, i.e. give a false positive. Hence the return code MAYBE_IN_SET.

The probability of false positives can be made arbitrarily small by increasing the size of the bit vector.

Interactive Example

Here’s an interactive demo of a Bloom Filter that uses 3 hash functions and 40 bits.

Insert:
Contains:



To illustrate a false positive, clear the bits, add the names “Brian” and “John”. Then lookup the name “Michael”.

Probability of false positives

Assuming uniform and independent hashing, the probability of a specific bit not being set to 1 after a single use of a hash function is

1
1

m

After inserting n elements using k hashes each, the probability of a specific bit still being 0 is thus


1
1

m


kn

Conversely, the probability of a specific bit being set to 1, is

1

1
1

m


kn

Since a false positive requires k random bits to be set to 1, the probability of a false positive after inserting n elements is


1

1
1

m


kn


k

Using the formula for calculating e

e

=
 
lim

x → ∞


1
1

x


x

…we can approximate the probability of false positives as

(

1 − e

kn / m

)

k

Optimal value for k

By fixing the n / m ratio in the previous equation, we can solve for k, and find the optimal value for it. That is, if you for example know you can spare 1 MB for the bit vector, and you know that you will insert ~10,000 elements, you can calculate the number of hash functions that minimizes the false positives in subsequent lookups.

Given n and m we want to minimize

= 
(

1 − e

kn / m

)

k

= 

e

ln 
(

1 − e

kn / m

)

k

= 

e

k

ln 
(

1 − e

kn / m

)

In other words, we want to minimize

k

ln 
(

1 − e

kn / m

)

By simple rules of powers and logarithms, we know that k = −m / n ln(ekn / m), so we can rewrite the above as follows:

− 

m

n

 ln 
(

e

kn / m

)
 ln 
(

1 − e

kn / m

)

By symmetry, we see that this is minimized when ekn / m = 1 − ekn / m, i.e. when ekn / m = 0.5. Intuitively this means that we should chose a k such that after insterting n elements, half of the m bits are set to true. Solving for k gives us

k

=

m ln 2

n

With this value for k the optimal false positive rate is approximately 0.6185m / n.

Chosing m and k

How many bits to use is a tradeoff between space and accuracy.

Smaller bit vector
Bigger bit vector
 
 
Less memory required
Lower risk of false positives.

Since the Bloom filter does not remember the actual elements that have been added, there’s no way to increase the size of the bit vector after the fact, and rehash as done in hash sets.

How many hash functions to use is a tradeoff between capacity and accuracy.

Fewer hash functions
More hash functions
 
 
Fills up the bit vector with ones at a slower pace.
Lower risk of false positives.

Here’s an interactive graph based on the equations from the previous section to help you decide on size of bit vector and number of hash functions:

Number of bits (m)

m = 

In bytes:

Number of hash functions (k)

k = 

This in the optimal choice for n = 

Not the optimal choice for any number of elements.

Acceptable false positive rate (p)

p =  %

Allows for up to elements.
Allows no elements to be added.
1 0 p n 1 2 3 4 5 6 7 8 9

Creating k different hash functions

You could use k different hash algorithms, such as SHA1, MD5, Murmur and so on. A simpler way to create k hash functions however, is to pick one, say MD5, and use

Warning: It would be a mistake to use the similar looking functions:

  • h1(x) = MD5(x) + 1

  • h2(x) = MD5(x) + 2

  • hk(x) = MD5(x) + k

The whole point of using more than one hash function, is that it should lower the probability of having a full collision of two distinct elements. If u ≠ v, and h1(u) happens to equal h1(v) we are “saved” by the fact that h2(u) most likely does not equal h2(v), and we would avoid a false positive. This would not be the case if using the bad hash functions described above.

Deleting Elements

Bloom filters does not support deleting elements, as resetting bits to 0 could affect the presense of other, unrelated, elements. There are a few tricks but, spoiler alert, there’s no silver bullet. They all have various drawbacks.

Secondary Bloom filters

To remove an element, the element is added to a secondary Bloom filter. An element is considered to be in set if it is in the first Bloom filter and not in the second Bloom filter.

Composite.contains(x) =

Additions.contains(x) and not Deletions.contains(x)

This approach has two downsides:

Counting Bloom filters

A counting Bloom filter uses counters instead of bits. Insertion is done by incrementing counters instead of setting bits to 1. If a counter has the value 3, it means that 3 elements hashed to that index.

With this type of Bloom filter you can undo insertions by decrementing counters. There are some caveats though:

Full article here: Counting Bloom Filter

Union and Intersection

Union and intersection of Bloom filters can be implemented by or'ing and and'ing the bit vectors.

Both Bloom filters must have the same bit vector size and must have used the same hash functions.

Or'ing two bit vectors results in a bit vector that is identical to what it would have looked like if all elements had been added to the same Bloom filter from scratch.

BloomFilter(S1) or BloomFilter(S2) = BloomFilter(S1 ∪ S2)

This is not the case when and'ing two bit vectors. Consider this example: If “Brian” is added to Bloom filter A, and “John” is added to Bloom filter B, and they happen to have one bit in common, the intersection of A and B will still have that bit set (even though A and B have no elements in common).

BloomFilter(S1) and BloomFilter(S2) BloomFilter(S1 ∩ S2)

This means that the intersection of two Bloom filters may have a higher false positive rate than it would have if it had been constructed from scratch (but it will not have any false negatives).

Applications

Seeing a couple of applications where Bloom filters really shine helps you recognize when to use them in the future.

Bypassing Expensive Computations

Suppose you have a big key-value store. So big that doesn’t fit in memory and has to be stored on disk. To avoid unnecessary disk access, a Bloom filter can be put in front:

key Bloom filter key is probably in set key is definitely not in set Lengthy lookup key not found key found null value

All keys in the on-disk table are also stored in the Bloom filter. The Bloom filter is memory efficient and can store orders of magnitude more keys in memory. False positives does not cause bugs, only occational unnecessary on-disk lookups.

Burton H. Bloom gives a very similar example based on a hyphenation algorithm in the original paper on Bloom filters. The hypenation algorithm should handle 500.000 words, 450.000 of which follow a simple set of rules. The remaining 50.000 words requires a special, more expensive, processing. The set of 50.000 words is deemed too large to fit in memory (paper published in 1970!) A Bloom filter, which can hold all 50.000 words in memory, is used to determine which code path to take. False positives are acceptable as that merely causes a longer (but still correct) computation for simple words.

Weak Password Filters

The Obvious Password Utility System (OPUS) uses a Bloom filter to store simple passwords such as dictionary words. When a user selects a new password, the Bloom filter is checked to see if it should be rejected due to being too easy to crack.

New password Bloom filter Not in set Probably in set Acceptable strength Reject, ask user for better password

A false positive simply means that a strong password is falsely rejected and that the user needs come up with another one. This approach is memory efficient and fast and also allows for adding more passwords to the filter such as previously used passwords.

Web Caches and One-Hit-Wonders

Approximately two thirds of all web requests are so called one-hit-wonders; requests which are done once and then never again. These requests are meaningless to cache. To minimize the amount of unnecessary caching, content delivery networks use Bloom filters to implement a cache-on-second-hit rule. New requests are added to the Bloom filter, and if a request comes in that exists in the filter, it means it’s not a one-hit-wonder, and the response is cached.

Request Bloom filter Request not in set (Not seen before) Request probably in set (Probably seen before) Add to Bloom filter Do not cache response Cache response

According to Algorithmic Nuggets in Content Delivery from Akamai their rate of disk writes is reduced by 44% using this technique.

The approach can be taken further using counting Bloom filters. Only 10% of requests are performed more than four times.

Sliding Bloom filters can be used to evict old requests and keep the rate of false positives low.

Comments

Be the first to comment!