Cuckoo Hashing

Cuckoo Hashing is a technique for implementing a hash table. As opposed to most other hash tables, it achieves constant time worst-case complexity for lookups.

Collisions are handled by evicting existing keys and moving them from one array to the other. This resembles the way a cuckoo chick pushes out an egg from the nest to make room for itself, hence the name Cuckoo Hashing.

Representation

It is implemented using two arrays of equal size and two hash functions:

Example: A Cuckoo Hash table with room for 16 keys.

h1 h2

Each hash function is associated with one of the arrays, i.e. it can be thought of as two separate sub hash tables.

It is similar to open addressing in the sense that each array slot can hold at most one key at a time.

Insertion

A new element is always inserted in the first hash table. Should a collision occurr, the existing element is kicked out and inserted in the second hash table. Should that in turn cause a collision, the second existing element will be kicked out and inserted in the first hash table, and so on. This continues until an empty bucket is found.

Example: The key a is inserted in the Cuckoo Hash table below. Keys get kicked around until a free slot is found.

h1 h2 a b c d e f g

If the number of displacements reaches a certain threshold (for instance due to a cycle among the inserted keys) rehashing takes place.

Rehashing is a linear operation, so worst-case complexity is O(n). Just as with other hashing techniques however, the ammortized run time can be shown to be O(1). The proof for this is non-trivial. See the original paper for details.

The performance degrades significantly as the load factor surpasses 50%. Higher load factor than this is not even considered in the original paper.

Lookup

If a key exists, it will be stored in its original bucket, either in the first array or the second one. In other words, at most two lookups are needed to figure out if the key exists or not, thus the worst-case complexity is O(1).

Removal

Removal is done simply by clearing the bucket storing the key. Again, worst-case complexity is O(1).

As opposed to other open addressing schemes there are no chains and no need to use deleted markings or so called tombstones (cf. section on removal in Hash Tables: Open Addressing)

Stashing

There's a small probability that a cycle is formed among the first few elements inserted. This may trigger a rehash even at a low load factor. To mitigate this, a constant-sized array called the stash can be used.

When a key is to be inserted, and a free bucket can't be found, the key is stored in the stash instead. The lookup algorithm is modified to search in the stash in addition to the two arrays. Rehashing is performed when a key can't be inserted and the stash is full.

Even with a stash of just three or four cells, rehashing can be postponed significantly and allow the hash table to function with higher load factors. Intuitively, the stash "takes the edge off" of the worst case scenario. Conceptually this is similar to the cellar in Coalesced Hashing and the improvements achieved by using 2-Choice Hashing.

With a fixed size stash, the runtime overhead is of O(1) for all operations, i.e. the Big-O complexity is unaffected.

D-Cuckoo Hashing

Cuckoo hashing can be generalized to use an arbitrary but fixed number of internal hash tables.

Example: A Cuckoo Hash table with 4 "layers".

h1 h2 h3 h4

Then a key in layer i is evicted, it is inserted in layer i + 1 (mod number of layers).

As with other techniques leveraging multiple hash functions, more hash functions increases the spread and reduces the likelyhood of rehash for a given number of insertions. The improvement going from two layers to three (and beyond) is however smaller than the improvement gained when going from one to two.

Comments

Be the first to comment!