Bloom Filter
A Bloom filter implements a set and has the following key properties:
- It is space efficient
- It supports insert and contains, both of which run in constant time
- It does not support remove
- It is based on hash functions
- The contains function may give false positives (but never false negatives).
The last point means the following:
- If contains(x) returns true, then x is probably in the set
- If contains(x) returns false, then x is definitely not in the set
Implementation
A Bloom Filter is implemented using a bit vector, v, of length m, and a k hash functions, h1, h2, …, hk which returns valid vector indexes, 0…m − 1.
To insert an element x, set bits h1(x), h2(x), …, hk(x) to 1.
insert(x):
for hash in [h1(x), h2(x), …, hk(x)]
vector[hash] = 1
To see if an element x is stored in the set, check that the bits h1(x), h2(x), …, hk(x) are set to 1.
contains(x):
for hash in [h1(x), h2(x), …, hk(x)]
if vector[hash] == 0
return NOT_IN_SET
return MAYBE_IN_SET
If we are unlucky, the hash values of the already inserted elements covers all hash values of some not-yet-inserted element e. Contains(e) will then falsly conclude that e looks to be in the set, i.e. give a false positive. Hence the return code MAYBE_IN_SET
.
The probability of false positives can be made arbitrarily small by increasing the size of the bit vector.
Interactive Example
Here’s an interactive demo of a Bloom Filter that uses 3 hash functions and 40 bits.
To illustrate a false positive, clear the bits, add the names “Brian” and “John”. Then lookup the name “Michael”.
Probability of false positives
Assuming uniform and independent hashing, the probability of a specific bit not being set to 1 after a single use of a hash function is
m
After inserting n elements using k hashes each, the probability of a specific bit still being 0 is thus
⎝
m
⎠
kn
Conversely, the probability of a specific bit being set to 1, is
⎝
m
⎠
kn
Since a false positive requires k random bits to be set to 1, the probability of a false positive after inserting n elements is
⎝
⎝
m
⎠
kn
⎠
k
Using the formula for calculating e…
e
x → ∞
⎝
x
⎠
−x
…we can approximate the probability of false positives as
1 − e
−kn / m
k
Optimal value for k
By fixing the n / m ratio in the previous equation, we can solve for k, and find the optimal value for it. That is, if you for example know you can spare 1 MB for the bit vector, and you know that you will insert ~10,000 elements, you can calculate the number of hash functions that minimizes the false positives in subsequent lookups.
Given n and m we want to minimize
1 − e
−kn / m
k
e
1 − e
−kn / m
k
e
k
1 − e
−kn / m
In other words, we want to minimize
k
1 − e
−kn / m
By simple rules of powers and logarithms, we know that k = −m / n ln(e−kn / m), so we can rewrite the above as follows:
m
n
e
−kn / m
1 − e
−kn / m
By symmetry, we see that this is minimized when e−kn / m = 1 − e−kn / m, i.e. when e−kn / m = 0.5. Intuitively this means that we should chose a k such that after insterting n elements, half of the m bits are set to true. Solving for k gives us
k
m ln 2
n
With this value for k the optimal false positive rate is approximately 0.6185m / n.
Chosing m and k
How many bits to use is a tradeoff between space and accuracy.
Since the Bloom filter does not remember the actual elements that have been added, there’s no way to increase the size of the bit vector after the fact, and rehash as done in hash sets.
How many hash functions to use is a tradeoff between capacity and accuracy.
Here’s an interactive graph based on the equations from the previous section to help you decide on size of bit vector and number of hash functions:
Number of bits (m)
m =
In bytes:
Number of hash functions (k)
k =
This in the optimal choice for n =
Acceptable false positive rate (p)
p = %
Creating k different hash functions
You could use k different hash algorithms, such as SHA1, MD5, Murmur and so on. A simpler way to create k hash functions however, is to pick one, say MD5, and use
-
h1(x) = MD5(x + 1)
-
h2(x) = MD5(x + 2)
-
⋮
-
hk(x) = MD5(x + k)
Warning: It would be a mistake to use the similar looking functions:
-
h1(x) = MD5(x) + 1
-
h2(x) = MD5(x) + 2
-
⋮
-
hk(x) = MD5(x) + k
The whole point of using more than one hash function, is that it should lower the probability of having a full collision of two distinct elements. If u ≠ v, and h1(u) happens to equal h1(v) we are “saved” by the fact that h2(u) most likely does not equal h2(v), and we would avoid a false positive. This would not be the case if using the bad hash functions described above.
Deleting Elements
Bloom filters does not support deleting elements, as resetting bits to 0 could affect the presense of other, unrelated, elements. There are a few tricks but, spoiler alert, there’s no silver bullet. They all have various drawbacks.
Secondary Bloom filters
To remove an element, the element is added to a secondary Bloom filter. An element is considered to be in set if it is in the first Bloom filter and not in the second Bloom filter.
Additions.contains(x) and not Deletions.contains(x)
This approach has two downsides:
-
Once an element has been deleted, you can’t add it back again, since you can’t delete it from the second Bloom filter. You could add a third Bloom filter to track deletions in the second filter (then a forth filter for tracking deletions in the third, …), but at this point you're probably better off with a different data structure.
-
A false positive in the second filter, means that you erroneously report the element as deleted. This results in a false negative in the composite filter, i.e. contains(x) would report “Probably not in the set” even though it is in the set.
Counting Bloom filters
A counting Bloom filter uses counters instead of bits. Insertion is done by incrementing counters instead of setting bits to 1. If a counter has the value 3, it means that 3 elements hashed to that index.
With this type of Bloom filter you can undo insertions by decrementing counters. There are some caveats though:
-
You must only delete elements that you know have previously been inserted. If you insert “Brian” and “John”, and then delete “Michael”, it may look like “Brian” and “John” are no longer in the set.
-
Counters are usually implemented using just a few bits (typically around 3 or 4). This means that the counter may max out under heavy load. When this happens operations no longer commute and the semantics begin to sway. Containment (for a given threshold) may give false negatives, and deletions may cause unrelated elements to look deleted.
Full article here: Counting Bloom Filter
Sliding Bloom Filter
Sliding Bloom filters allow you to evict old elements. A sliding Bloom filter is composed of two (or more) Bloom filters. At any given time one of the Bloom filters is recognized as the master, and the other filters are slaves. The role of master rotates among all filters.
New elements are inserted into the master. Once a certain amount time has passed, or once the master has become “full”, the oldest slave filter is cleared out, and the master role rotates to the next filter. An element is considered to be in the combined filter if it's in any of the sub filters.
Here's an illustration of a sliding Bloom filter with two sub filters:
A sliding Bloom filter is effectively a sliding window view of the last elements inserted.
Full article here: Sliding Bloom Filter
Union and Intersection
Union and intersection of Bloom filters can be implemented by or'ing and and'ing the bit vectors.
Both Bloom filters must have the same bit vector size and must have used the same hash functions.
Or'ing two bit vectors results in a bit vector that is identical to what it would have looked like if all elements had been added to the same Bloom filter from scratch.
BloomFilter(S1) or BloomFilter(S2) = BloomFilter(S1 ∪ S2)
This is not the case when and'ing two bit vectors. Consider this example: If “Brian” is added to Bloom filter A, and “John” is added to Bloom filter B, and they happen to have one bit in common, the intersection of A and B will still have that bit set (even though A and B have no elements in common).
BloomFilter(S1) and BloomFilter(S2) ≠ BloomFilter(S1 ∩ S2)
This means that the intersection of two Bloom filters may have a higher false positive rate than it would have if it had been constructed from scratch (but it will not have any false negatives).
Applications
Seeing a couple of applications where Bloom filters really shine helps you recognize when to use them in the future.
Bypassing Expensive Computations
Suppose you have a big key-value store. So big that doesn’t fit in memory and has to be stored on disk. To avoid unnecessary disk access, a Bloom filter can be put in front:
All keys in the on-disk table are also stored in the Bloom filter. The Bloom filter is memory efficient and can store orders of magnitude more keys in memory. False positives does not cause bugs, only occational unnecessary on-disk lookups.
Burton H. Bloom gives a very similar example based on a hyphenation algorithm in the original paper on Bloom filters. The hypenation algorithm should handle 500.000 words, 450.000 of which follow a simple set of rules. The remaining 50.000 words requires a special, more expensive, processing. The set of 50.000 words is deemed too large to fit in memory (paper published in 1970!) A Bloom filter, which can hold all 50.000 words in memory, is used to determine which code path to take. False positives are acceptable as that merely causes a longer (but still correct) computation for simple words.
Weak Password Filters
The Obvious Password Utility System (OPUS) uses a Bloom filter to store simple passwords such as dictionary words. When a user selects a new password, the Bloom filter is checked to see if it should be rejected due to being too easy to crack.
A false positive simply means that a strong password is falsely rejected and that the user needs come up with another one. This approach is memory efficient and fast and also allows for adding more passwords to the filter such as previously used passwords.
Web Caches and One-Hit-Wonders
Approximately two thirds of all web requests are so called one-hit-wonders; requests which are done once and then never again. These requests are meaningless to cache. To minimize the amount of unnecessary caching, content delivery networks use Bloom filters to implement a cache-on-second-hit rule. New requests are added to the Bloom filter, and if a request comes in that exists in the filter, it means it’s not a one-hit-wonder, and the response is cached.
According to Algorithmic Nuggets in Content Delivery from Akamai their rate of disk writes is reduced by 44% using this technique.
The approach can be taken further using counting Bloom filters. Only 10% of requests are performed more than four times.
Sliding Bloom filters can be used to evict old requests and keep the rate of false positives low.