Hash Tables: Open Addressing

A hash table based on open addressing (sometimes referred to as closed hashing) stores all elements directly in the hast table array, i.e. it has at most one element per bucket. The benefits of this approach are:

For brief a comparison with closed addressing, see Open vs Closed Addressing.

Insertion

When inserting a key that hashes to an already occupied bucket, i.e. a collision occurs, the search for an empty bucket proceeds through a predefined search sequence. The first empty bucket found is used for the new key.

Example: Inserting key k using linear probing. (Other probing techniques are described later on.)

insert(k) hash(k) = third bucket ? Occupied ? Occupied ? Occupied Empty, insert k here

Rehashing ensures that an empty bucket can always be found.

Lookup

When looking up a key, the same search sequence is used. The search terminates when the key is found, or an empty bucket is found in which case the key does not exist in the table.

Example: Here's how a successful lookup could look:

lookup(k) hash(k) = third bucket x Wrong key y Wrong key z Wrong key k Key found!

Example: Here's how an usuccessful lookup could look:

lookup(k) hash(k) = third bucket x Wrong key y Wrong key z Wrong key Empty: Key not found

Removal

Since the lookup algorithm terminates if an empty bucket is found, care must be taken when removing elements. If a bucket is simply cleared out, it can create a gap in the search sequence, and cause the lookup algorithm to terminate too early.

For this reason, buckets are typically not cleared, but instead marked as "deleted". Such buckets, called tombstones, do not cause lookups to terminate early, and can be reused by the insert algorithm

Contamination

As data is inserted and deleted over and over, empty buckets are gradually replaced by tombstones. As the sequences of non-empty buckets get longer, the performance of lookups degrade. This phenomenon is called contamination, and the only way to recover from it is to rehash.

Various Probing Techniques

The order in which insert and lookup scans the array varies between implementations. A few common techniques are described below. (All indexes are modulo the array length.)

Linear Probing

If a collision occurs in bucket i, the search sequence continues with

This approach achieves good cache performance since the probing sequence is linear in memory.

A problem however, is that it tends to create long sequences of occupied buckets. The reason is that an existing chain will act as a "net" and catch many of the new keys, which will be appended to the chain and exacerbate the problem.

Example: Consider the probabilities for which bucket the next key will end up in, in the following situation:

10% 10% 0% (occupied) ? 0% (occupied) ? 0% (occupied) ? 40% 10% 10% 10% 10%

In other words, long chains get longer and longer, which is bad for performance since the average number of buckets scanned during insert and lookup increases. The phenomenon is called primary clustering or just clustering.

Quadratic Probing

With quadratic probing a search sequence starting in bucket i proceeds as follows:

This creates larger and larger gaps in the search sequence and avoids primary clustering.

If one key hashes to the same bucket as another key, the search sequence for the second key will go in the footsteps of the first one. If this happens repeatedly (for example due to a poorly implemented hash function) long chains will still form, and cause performance to degrade. The phenomenon is called secondary clustering.

Double Hashing

With double hashing, another hash function, h2 is used to determine the size of the steps in the search sequence. If h2(key) = j the search sequence starting in bucket i proceeds as follows:

(If j happens to evaluate to a multiple of the array length, 1 is used instead.)

This approach is worse than the previous two regarding memory locality and cache performance, but avoids both primary and secondary clustering.

Comparison of Probing Techniques

Linear Probing
Quadratic Probing
Double Hashing

Complexity

The naive open addressing implementation described so far have the usual properties of a hash table. Insert, lookup and remove all have O(n) as worst-case complexity and O(1) as expected time complexity (under the simple uniform hashing assumption).

See separate article, Hash Tables: Complexity, for details.

Variations of Open Addressing

There are many, more sophisticated, techniques based on open addressing. The main objective is often to mitigate clustering, and a common theme is to move around existing keys when inserting a new key. With clever key displacement algorithms, keys can end up closer to the buckets they originally hashed to, and thus improve memory locality and overall performance.

Examples of open addressing techniques (strongly recommended reading):

Comments

Be the first to comment!