Hash Tables
Hash tables (also known as hash maps) are associative arrays, or dictionaries, that allow for fast insertion, lookup and removal regardless of the number of items stored.
Here’s an example of how they can be used:
HashTable phoneBook = new HashTable()
phoneBook.put("Adam", "(202) 555-0309")
phoneBook.put("Beth", "(335) 754-1215")
print("Adam’s number: " + phoneBook.get("Adam"))
Internally they are similar to card indexes: An item can be found quickly by first jumping to its approximate location, and then searching locally from there.
Representation
The simplest way to implement a hash table is to use an array of linked lists.
Each array cell is called a bucket, and each list node stores a key-value pair.
Following the analogy from the previous section, the array cells that can be accessed quickly can be thought of as index cards, and nodes in the list as data cards.
Finding the right bucket
Which bucket a given key belongs to, is determined by a hash function as follows…
bucket index = hash(key) % length(bucket array)
…where % denotes the remainder operator.
For efficiency reasons, an additional variable is used to keep track of the current total number of key-value pairs.
There are other ways to implement hash tables, but this is the representation that will be used here to introduce the subject. See section Other Collision Resolution Strategies for alternatives.
Insertion
Here's how a mapping from key k
to value v
is inserted in a hash table:
If found update value of existing node.
n = buckets[i]
while n ≠ NULL:
if n.key == k:
n.value = v
return
last = n
n = n.next
if last == NULL:
bucket[i] = new_node
else:
last.next = new_node
Lookup
Here's how the value for a key k
is retrieved from a hash table:
while n ≠ NULL:
if n.key == k:
return n.value
n = n.next
return NOT_FOUND
Remove
Here's how a key-value pair is removed from a hash table:
n = buckets[i]
while n ≠ NULL:
if n.key == k:
if last == NULL:
buckets[i] = n.next
else:
last.next = n.next
n = n.next
Rehashing
As the lengths of the linked lists grow, the average lookup time increases.
To keep the linked lists short and lookups fast, the number of buckets must be increased (and nodes redistributed). This is called rehashing.
new_buckets = new Bucket[new_size]
while buckets[i] ≠ NULL:
n = buckets[i]
buckets[i] = n.next
h = hash(n.k)
i = h % new_length
n.next = new_buckets[i]
new_buckets[i] = n
During rehashing the number of buckets is increased by some factor, typically 2. As shown in the article Hash Tables: Complexity, it is important that the growth is exponential.
Traversal
while n ≠ NULL:
process(n)
n = n.next
Note that the order in which the key-value pairs are traversed is unpredictable due to the nature of hashing and rehashing.
Load Factor and Capacity
The load factor is the average number of key-value pairs per bucket.
It is when the load factor reaches a given limit that rehashing kicks in. Since rehashing increases the number of buckets, it reduces the load factor.
The load factor limit is usually configurable and offers a tradeoff between time and space costs.
The default load factor for a Java HashMap
is 0.75 and for a C# Hashtable
it’s 1.0.
The capacity is the maximum number of key-value pairs for the given load factor limit and current bucket count.
Since rehashing increases the number of buckets, it increases the capacity.
The default initial capacity for a Java HashMap
is 12 and for a C# Hashtable
it’s 0, i.e. the bucket array is initialized lazily upon first insertion.
Example: Here’s the structure of a hash table, configured with load factor limit of 4.
Complexity Analysis
As is clear from the way insert, lookup and remove works, the run time is proportional to the length of the linked lists. In worst case all keys hash to the same bucket, i.e. the whole data structure becomes equivalent to a linked list. This means that all operations run in O(n).
Assuming however that keys are dispersed evenly among the buckets, it can be shown that the complexity of the expected run time is O(1) for all operations.
See article Hash Tables: Complexity for a detailed analysis.
Other Collision Resolution Strategies
When two keys hash to the same bucket, i.e. when
it’s called a collision. The collision handling strategy described so far (one linked list per bucket) is an example of closed addressing using separate chains. Instead of linked lists, one can also use binary search trees, or as in the case of Java 9, a linked list up to a certain limit, and then convert it to a BST when more elements are added.
The alternative, open addressing, is to store all key-value pairs directly in the hash table array, i.e. there's at most one element per bucket. (The size of the array must always be at least as large as the number of elements stored.)
See Open vs Closed Addressing for a brief side-by-side comparison of the techniques or Open Addressing for details on open addressing.
A third option, which is more of theoretical interest but mentioned here for completeness, is to use a hash function that maps each key to slot of its own, and thus avoiding collisions all together. This is called perfect hashing.