2-Choice Hashing

2-choice hashing is (just as 2-Left Hashing) more of a variation that can be applied to other hash table implementations, rather than a full implementation itself.

It's commonly used with separate chaining, but can technically be used with any resolution strategy.

Insertion

In 2-choice hashing you use two hash functions. When inserting a key, you compute which slot to use according to each hash function. You then proceed to insert the key in the slot with fewest collisions. Ties are resolved randomly.

Example: Insertion using 2-Choice Hashing.

To insert: x ← Fewer collisions h1 h2

By always having a "backup bucket" the insert operation can almost always avoid piling up on an already overloaded bucket.

Lookup

When looking up a key, you hash the key with both hash functions and look through both buckets.

Removal

Similar to how lookup works.

Complexity

Assuming the hash computation and load comparison runs in constant time, the runtime of the insert operation is unaffected. There's a factor two runtime increase for worst case lookup and remove, but the Big-O complexity is unaffected.

The longest chain will, with high probability, contain no more than Θ(log(log(n)) keys. (Compare this with the case of a a single hash function, where the longest chain is expected to have Θ(log(n)/log(log(n))) keys.) This follows directly from a more general statistical concept called The Power of Two Choices, and is often discussed in the context of load balancing.

The intuition for why performance improves with this approach, is that one always "takes the edge off" of the worst case scenario. Outliers are avoided similar to how truncated means are used in many sports that rely on scores by judges.

Hash Functions should be independent

In languages like Java and C# where there's always one hash function available, it is tempting to derive the second hash function from the first. For example by using something like:

h2(x) = h1(x) xor 0xCAFEBABE         (bad idea!)

This is unfortunately not as good as two independent hash functions. Intuitively h2 will always provide the same "backup bucket" whenever h1 produces an unfortunate result. One "mistake" will always follow the footsteps of the last time that mistake was made. The randomness is thus removed which impacts the spread of the keys.

D-Choice Hashing

The approach can be generalized to an arbitrary number of hash functions. Unfortunately additional hash functions only decrease the maximum chain length by a constant factor.

Related Techniques

2-Left Hashing and Cuckoo Hashing are examples of hash table techniques that rely on two hash functions.

Cuckoo Hashing relies heavily the assumption that the two hash functions are independent. Implementations of Cuckoo Hashing usually reserv a few slots (called the stash) for dealing with collisions, taking the idea of avoiding the very worst case one step further.

Comments

Be the first to comment!