Hopscotch Hashing

A Hopscotch hash table is based on open addressing i.e. it has an array of buckets and stores at most one key-value pair in each bucket.

Upon collisions, Hopscotch hashing aims to keep key-value pairs close to the original bucket (in it's neighborhood). This keeps the chains short and achieves good memory locality.

Neighborhoods

The neighborhood of bucket i are the H consecutive buckets starting in i. H is a configurable constant.

A key-value pair will always be stored within the neighborhood of the bucket it originally hashes to. (During lookup, only the buckets in the relevant neighborhood are considered.)

Neighborhoods Overlap

The neighborhoods of consecutive buckets overlap. Each bucket is part of H neighborhoods.

Optimal value for H: The hash table can't handle more than H collisions in one bucket, so H shouldn't be too small. Also, if H is too large, lookups will be less efficient. In practice, a good choice for H is one that optimizes the usage of the cache lines. Good performance can be achieved if an entire neighborhood can be fetched in one memory roundtrip. The implementation in the original paper uses H = 32.

Insertion

Suppose a key, k, is to be inserted, and that it hashes to bucket i. Bucket i is referred to as the home bucket of k.

Case: Home bucket is empty

If the home bucket is empty, the key is placed there and we're done.

Note: Any bucket in the neighborhood of i is a valid bucket for k.

If bucket i + d is empty, and d < H, then k can, if needed later, be moved down to i + d, effectively moving the empty space upwards.

Case: Home bucket is occupied

If bucket i is not empty, then we search for an empty bucket linearly, starting from i (linear probing). Suppose we find an empty bucket at index j.

If j is within in the neighborhood of i we put k there and return.

If j is outside of i’s neighborhood, we try to move an element between i and j downwards so that the empty space moves upwards, closer to i. We repeat this process until the empty space has been moved into the neighborhood of i, and then use that space for k.

For the sake of this example, let's assume that i = 17 and j = 24 and that we have neighborhoods of size 4. The home bucket of each key will be noted below the key.

The key to move down to j must have j in the neighborhood of its home bucket. Since no “leap” can be longer than H − 1 we first examine the key in bucket 24 − (H − 1) = 21 (key e). We see that 21 is the home bucket of e. This means that j is within its neighborhood so we perform the swap.

The empty space (21) is still not within the neighborhood of i (17-20), so we repeat.

Looking at bucket 18 (key b) we see that its home bucket is 16. Moving it down to the empty space at index 21 would move it out of its neighborhood, so we let it be and look at the next bucket: bucket 19 (key c). The home bucket of c is 18, which means that 22 is within its neighborhood and it's safe to swap.

The empty space is now within the neighborhood of i. We use this space to insert the new key k. In summary, the full insertion looks as follows:

Initial state

e is moved to bucket 24

b can't be moved to bucket 21

c is moved to bucket 21

k is inserted in bucket 19

If this displacement algorithm fails (some part of the array doesn't have any swaps available) then rehashing kicks in.

Complexity

As with other types of hash tables, worst-case complexity is O(n). However, assuming the hash function disperses the keys evenly, the expected time complexity is constant. See Lemma 6 in the original paper.

Lookup

To look up a key, k, the home bucket, b, is located based on the hash of k. If the home bucket stores k we're done, otherwise we examine the other buckets in the neighborhood that has b as their home bucket.

Complexity

Since lookup examines at most H buckets (which is constant), lookup has a worst-case complexity of O(1).

Removal

The relevant bucket is located the same way as a lookup is performed. The key is removed by clearing the bucket. Any meta info about neighboring buckets (discussed further down) must be updated as well.

Hopscotch hashing does not require a special “deleted” marking as other open addressing techniques do, and does not suffer from problems with contamination. (See section on contamination in Hash Tables: Open vs Closed Addressing.)

Complexity

Similar to the analysis of lookup; O(1).

Rehashing

Rehashing works “as usual”. A larger array is allocated, each element is removed from the original array and inserted in the new array.

Keeping track of home buckets

To quickly find suitable candidates for swapping during insert, the implementation provided in the original paper on Hopscotch hashing stores meta information about each neighborhood in the home bucket. Two variants are suggested:

Each bucket has a bitmap of H bits. Bit d in the bitmap of bucket b is set if and only if bucket b + d is occupied by a key that has b as its home bucket.

The benefit of this representation is that the bucket itself stores information about all keys that belong to that bucket. This is useful for lookups and for finding keys eligible for swapping.
In the other solution, each bucket has a linked list of references to other buckets. The linked list in bucket b contains references to all buckets storing keys that have b as their home bucket.

With large neighborhoods this may be a more memory efficient representation. On the other hand it introduces some indirection which could impact performance.

Comments

Be the first to comment!