# Hash Tables

Hash tables (also known as hash maps) are associative arrays, or dictionaries, that allow for fast insertion, lookup and removal regardless of the number of items stored.

Here’s an example of how they can be used:

```
HashTable phoneBook = new HashTable()
phoneBook.put("Adam", "(202) 555-0309")
phoneBook.put("Beth", "(335) 754-1215")
print("Adam’s number: " + phoneBook.get("Adam"))
```

Internally they are similar to card indexes: An item can be found quickly by first jumping to its approximate location, and then searching locally from there.

## Representation

The simplest way to implement a hash table is to use an **array of linked lists**.

Each array cell is called a **bucket**, and each list node stores a **key-value pair**.

Following the analogy from the previous section, the array cells that can be accessed quickly can be thought of as index cards, and nodes in the list as data cards.

## Finding the right bucket

Which bucket a given key belongs to, is determined by a **hash function** as follows…

*bucket index* = hash(*key*) % length(*bucket array*)

…where % denotes the remainder operator.

For efficiency reasons, an additional variable is used to keep track of the current total number of key-value pairs.

There are other ways to implement hash tables, but this is the representation that will be used here to introduce the subject. See section Other Collision Resolution Strategies for alternatives.

## Insertion

Here's how a mapping from key `k`

to value `v`

is inserted in a hash table:

If found update value of existing node.

n = buckets[i]

while n ≠ NULL:

if n.key == k:

n.value = v

return

last = n

n = n.next

if last == NULL:

bucket[i] = new_node

else:

last.next = new_node

## Lookup

Here's how the value for a key `k`

is retrieved from a hash table:

while n ≠ NULL:

if n.key == k:

return n.value

n = n.next

return NOT_FOUND

## Remove

Here's how a key-value pair is removed from a hash table:

n = buckets[i]

while n ≠ NULL:

if n.key == k:

if last == NULL:

buckets[i] = n.next

else:

last.next = n.next

n = n.next

## Rehashing

As the lengths of the linked lists grow, the average lookup time increases.

To keep the linked lists short and lookups fast, the number of buckets must be increased (and nodes redistributed). This is called **rehashing**.

new_buckets = new Bucket[new_size]

while buckets[i] ≠ NULL:

n = buckets[i]

buckets[i] = n.next

h = hash(n.k)

i = h % new_length

n.next = new_buckets[i]

new_buckets[i] = n

During rehashing the number of buckets is increased by some factor, typically 2. As shown in the article Hash Tables: Complexity, it is important that the growth is exponential.

## Traversal

while n ≠ NULL:

process(n)

n = n.next

Note that the order in which the key-value pairs are traversed is **unpredictable** due to the nature of hashing and rehashing.

## Load Factor and Capacity

The **load factor** is the average number of key-value pairs per bucket.

It is when the load factor reaches a given limit that rehashing kicks in. Since rehashing increases the number of buckets, it reduces the load factor.

The load factor limit is usually configurable and offers a **tradeoff between time and space costs**.

The default load factor for a Java `HashMap`

is 0.75 and for a C# `Hashtable`

it’s 1.0.

The **capacity** is the maximum number of key-value pairs for the given load factor limit and current bucket count.

Since rehashing increases the number of buckets, it increases the capacity.

The default initial capacity for a Java `HashMap`

is 12 and for a C# `Hashtable`

it’s 0, i.e. the bucket array is initialized lazily upon first insertion.

**Example:** Here’s the structure of a hash table, configured with load factor limit of 4.

## Complexity Analysis

As is clear from the way insert, lookup and remove works, the run time is proportional to the length of the linked lists. In worst case all keys hash to the same bucket, i.e. the whole data structure becomes equivalent to a linked list. This means that all operations run in *O*(*n*).

Assuming however that keys are dispersed evenly among the buckets, it can be shown that the complexity of the expected run time is *O*(1) for all operations.

See article Hash Tables: Complexity for a detailed analysis.

## Other Collision Resolution Strategies

When two keys hash to the same bucket, i.e. when

*k*

_{1}) ≡ hash(

*k*

_{2}) (mod number of buckets)

it’s called a **collision**. The collision handling strategy described so far (one linked list per bucket) is an example of **closed addressing** using separate chains. Instead of linked lists, one can also use binary search trees, or as in the case of Java 9, a linked list up to a certain limit, and then convert it to a BST when more elements are added.

The alternative, **open addressing**, is to store all key-value pairs directly in the hash table array, i.e. there's at most one element per bucket. (The size of the array must always be at least as large as the number of elements stored.)

See Open vs Closed Addressing for a brief side-by-side comparison of the techniques or Open Addressing for details on open addressing.

A third option, which is more of theoretical interest but mentioned here for completeness, is to use a hash function that maps each key to slot of its own, and thus avoiding collisions all together. This is called **perfect hashing**.