Hash Tables: Complexity

This article is written with separate chaining and closed addressing in mind, specifically implementations based on arrays of linked lists. Most of the analysis however applies to other techniques, such as basic open addressing implementations.

Only operations that scale with the number of elements n are considered in the analysis below. In particular, the hash function is assumed to run in constant time.

Length of chains

As is clear from the way lookup, insert and remove works, the run time is proportional to the number of keys in the given chain. So, to analyze the complexity, we need to analyze the length of the chains.

Worst Case

If we're unlucky with the keys we encounter, or if we have a poorly implemented hash function, all keys may hash to the same bucket.

This means that the worst-case complexity of a hash table is the same as that of a linked list: O(n) for insert, lookup and remove.

This is however a pathological situation, and the theoretical worst-case is often uninteresting in practice. When discussing complexity for hash tables the focus is usually on expected run time.

Uniform Hashing

The expected length of any given linked list depends on how the hash function spreads out the keys among the buckets. For the purpose of this analysis, we will assume that we have an ideal hash function. This is a common assumption to make. So common in fact, that it has a name:

Simple Uniform Hashing Assumption

In a hash table with m buckets, each key is hashed to any given bucket…

  • …with the same probability, 1 / m
  • …independently of which bucket any other key is hashed to.

With SUHA the keys are distributed uniformly and the expected length of any given linked list is therefore n / m.

The Magic Part

As you may recall, the n / m ratio is called the load factor, and that rehashing guarantees that this is bound by the configured load factor limit. (See Hash Table Load Factor and Capacity.) Since the load factor limit is constant, the expected length of all chains can be considered constant.

A common misconception is that SUHA implies constant time worst case complexity. SUHA however, does not say that all keys will be distributed uniformly, only that the probability distribution is uniform. Even with a uniform probability, it is still possible for all keys to end up in the same bucket, thus worst case complexity is still linear.

Insert

An insertion will search through one bucket linearly to see if the key already exists. This runs in O(n / m) which we know from the previous section is O(1). If the key is found, a value is updated, if not, a new node is appended to the list. Regardless of which, this part is in O(1).

If we're unlucky, rehashing is required before all that. Since rehashing performs n constant time insertions, it runs in Θ(n).

That being said, rehashes are rare. In fact, they are so rare that in average insertion still runs in constant time. We say that the amortized time complexity for insert is O(1).

Proof: Suppose we set out to insert n elements and that rehashing occurs at each power of two. Let's assume also that n is a power of two so we hit the worst case scenario and have to rehash on the very last insertion.

We would have to rehash after inserting element 1, 2, 4, …, n. Since each rehashing reinserts all current elements, we would do, in total, 1 + 2 + 4 + 8 + … + n = 2n − 1 extra insertions due to rehashing. In other words, all rehashing necessary incurs an average overhead of less than 2 extra insertions per element.

We conclude that despite the growing cost of rehashing, the average number of insertions per element stays constant.

Lookup

A lookup will search through the chain of one bucket linearly. This is in O(n / m) which, again, is O(1).

Removal

A removal will search through one bucket linearly. If the key is found, it is “unlinked” in constant time, so remove runs in O(1) as well.

If one wants to reclaim unused memory, removal may require allocating a smaller array and rehash into that. In this case removal runs in O(n) in worst case, and O(1) amortized.

Traversal

There's no way to know which buckets are empty, and which ones are not, so all buckets must be traversed. This means traversal is Θ(n + m).

In practice this is only relevant if the hash table is initialized with a very large capacity. After the first rehashing the number of buckets can be considered linearly proportional to the number of items, and traversal is Θ(n).

One can avoid traversing the empty buckets by using an additional linked list. For details see article Linked Hash Table.

Comments

Be the first to comment!