Unit 6: Hash tables

This is the archived website of IC 312 from the Fall 2014 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

**Credit**:
Gavin Taylor for the
original version of these notes.

Hashtables are one of the really simple and effective ideas in computer science. They show up absolutely everywhere, because they work SO well.

Suppose in our Map, we were going to get 100 pieces of data added, and the keys would be the integers 0-99. How would you store them? You'd just put them in an array, right? The runtime of insert, get, and remove, would all be O(1). That'd be wonderful!

Of course, this is unrealistic for a couple reasons. First, we don't always know how many pieces of data we have coming in, and we don't always have integral keys. And, even if we do have integral keys, odds are they're not all small, positive numbers (large numbers would require too large an array to be worth it).

An O(1) insert, get, and remove is quite the prize, so let's not give up so quickly. Let's look at these problems one at a time.

**My Keys Aren't Integers!**

Well... make them integers. Consider Strings, for example. We know that a char is actually just an integer. So, maybe, just add the characters in the string? That gives us an int! Job done!

We can imagine coming up with a similar function for any type of key. This function, unsurprisingly, is called a

**hash function**, and it takes in a key and returns an int.**My ints Are Really Big (or are negative)!**

Well, decide on a size for your array. Then, using absolute value and the modulo operator (%), you can make your ints be as small and positive as you need.

To be specific, if \(m\) is the size of your hashtable array, you can take any integer \(x\) and compute \({\rm abs}(x) \bmod m\). That will always give you an integer between 0 and \(m-1\) - exactly the range of valid array indexes!

**What if two keys hash to the same integer?**

Well yes that it a problem, because it means that more than one thing is supposed to be stored in one slot in the array. This situation is called a

*collision*, and it's going to be the trickiest part about making a good hashtable. There are two parts: first you want to choose a good hash function to avoid having very many collisions, and then you set up your hashtable so that it can handle any collisions that do happen.

If you took SI 110, then you've seen hash functions before, in the form of
*cryptographic hash functions* like MD5 or SHA-1. Remember the
Rubik's Cube? You learned that you could go from a string to a number (maybe
it was hexidecimal, but it was a number nonetheless), but it would be
extremely difficult to go from the number to the string that resulted in that
number. The words "hash function" just meant "a function that takes you from
a string to a number;" the cryptographic part is what put this extra
one-way-function requirement on us so that it can't be reversed.

In data structures, a hash function still has to do the job of going from any type of key to an integer, and the hash function will work best if it "scrambles" up the key in some way. But we don't care about our hash functions being one-way: there's absolutely no harm in someone being able to predict which key might hash to some integer.

So is any hash function as good as any other hash function? Well, no.
First of all, a hash function has to be **deterministic**, which is to say,
every time was call our hash function on our key, it has to result in the same
number. Second, we want our hash function to **minimize collisions**.
Suppose a hash function, given a random String, is more likely to return numbers between
10 and 20 than it is for any other location. That would be a bad hash
function, as there would be a huge number of collisions in indices between 10
and 20, while other parts of the hash table would be underused!

Writing good hash functions is an active area of research for computer scientists and number theorists, so we won't become experts at doing this. But there are a couple of general principles that are used to minimize collisions:

- The hash function should produce any integer in its range, with equal probability. (Like we just said.)
- The hash function should depend somehow on the entire key. For example, if you are hashing a string, you had better make it depend on every character in the string!
- The hash function should be somehow "strange", not conforming to any
patterns that might occur in the data itself. For example, if you are storing
prices at a convenience store, many of them will end in 99 cents. If you are storing
the displayed names of USNA students, many of them will start with MIDN.
A great hash function can take in a very non-random set of keys and produce a bunch
of integer hashes that
*look*totally random.

Java's Object class defines a `hashCode()` method, but it's not very
good, because at the Object class level, we don't know anything about the
object at all. This means they base it on the address the object is stored
at. This is problematic. Suppose we have stored a String "hello" in our
hashtable, and now want to find it by calling `get("hello")`. Because
these two objects are potentially stored at different address locations, we might get
different hashes, and this get wouldn't work.

Fortunately, for Strings, Java has fixed this problem by overriding this
method. Check
out the Javadoc! If you're implementing a hashtable for Strings, you
should feel free to use this. Another cool `hashCode()` method to
check out is the one for
Java's
List interface.
Yes, lists can be keys in a map!

Despite our best attempts, our hash function has resulted in a collision: two keys, after being absolute value'd and modulo'd, have resulted in the same hash integer. Even with the very best hash function, this is still going to happen unless we make the array ridiculously huge, as a consequence of the Birthday Problem. Just as there are lots and lots of ways to make self-balanced trees, there are lots and lots of ways to resolve collisions. We're going to talk about two of the most common: Open Addressing, and Separate Chaining.

**Open Addressing**

Array full at the index you'd like to insert something? Move to the next one. Your find, then, goes to where it'd like to be inserted, then moves along linearly, until it finds an empty space. That's called "linear probing", and it's fast and simple, but can result in clusters of dead space in the hash table. Many other, more sophisticated approaches to open addressing have been proposed which eliminate the clustering issue.

No matter how sophisticated, though, the problem remains: if the array isn't big enough, the array fills up, and now insert() and get() become O(n) operations. With a big enough array, with enough empty space, though, this works well.

**Separate Chaining**

In this approach, we have an array of Linked Lists (or Trees, or something). Upon hashing to a certain spot, you add that element to the data structure of all keys that have been hashed to that same index. get() just has to find that key in that data structure.

Linked lists work really well for separate chaining, because your

`set()`operation can just insert the new element at the front of the list. That guarantees that`set()`at least is always \(O(1)\) time.

There are so many different options and variations with hash tables
that it can make them kind of tough to talk about. So let's make a few
assumptions, which will hold now and forevermore in this class
*unless otherwise specified*:

- We have a great hash function. (To be precise, we would say that our hash function has a uniform random distribution over the range of hash table indexes.)
- There are \(n\) keys and the hashtable array size is \(m\).
The array size \(m\) is a
*constant multiple*of \(n\). - We are using separate chaining to resolve collisions, with linked lists used at each array index.

On average, there are \(\frac{n}{m}\) elements hashed into each index.
This number is known as the **load factor**.
Since we are assuming \(m\) is always a constant multiple or fraction of
\(n\), this means that there are \(O(1)\) elements in each bucket, on average.

But what do we do when \(n\) changes because new elements are inserted - do we have to change the size of the array every time? Of course not; we can amortize just like with array-based stacks and queues!

In practice, resizing the array usually happens whenever the load factor becomes larger than some set threshold ratio (Java uses a maximum load factor of .75 in its HashMap). At that point, you double the size of the array and rehash everything (see why everything has to be rehashed, and not just moved to the same index in the new array?). The cost of resizing is \(O(n)\), but we have to do it increasingly rarely as the array gets larger, and it works out to be amortized \(O(1)\) time per insertion.

To do an insert, we perform the hash (O(1)), and insert the element into
that spot in the array (if there's already something there, we just add our
new element to the front of the linked list at that index). So, the
worst-case cost of a `set`

operation is \(O(1)\).

To do a `get()`

,
we perform the hash (O(1)), and then get the element from the
data structure stored at that spot. Since the average bucket size is
\(O(1)\), the *average* cost of `get()`

is \(O(1)\).

Holy crap! Everything is O(1)! Why the hell did we just spend this long on Trees? Well, in the worst case, everything hashes to the same index, and get and remove are actually O(n). But also, there are other ADTs...

Ordered Maps are just an extension of Maps, with a few extra methods.

`first`- returns the element with the smallest key.`last`- returns the element with the largest key.`next(k)`- returns the element with the smallest key larger than k.`prev(k)`- returns the element with the largest key smaller than k.

Let's first look at the naive implementations:

Unsorted Linked List | Sorted Array | |
---|---|---|

get(key) | O(n) | O(log n) |

set(key, value) | O(1) | O(n) |

first() | O(n) | O(1) |

last() | O(n) | O(1) |

next(k) | O(n) | O(log n) |

prev(k) | O(n) | O(log n) |

Caution: if you're thinking that `first`

and `last`

could be \(O(1)\) for a linked list if we only kept track of the smallest
and largest things as we add them, think again! The problem is that you
can also have removals (by setting a key's value to `null`

), and then
if you remove the largest thing you're stuck!

Obviously, we're going to assume that we're using some flavor of our self-balancing trees: AVL, 2-3-4, Red-Black, or something else. These different flavors have slightly different heights, but these heights are all O(log n).

So, insert, get, and remove are easy - they're all O(log n). Let's consider our new methods.

Hopefully, you can write first() - it's just findMin(root). How long does this take? In the worst case, O(log n). last() is the same thing, just findMax.

To calculate next(k) (or prev(k)), we first have to find k, which is O(log n). If the node containing k has a right child, next() is found in findMin(n.getRight()). If it doesn't, it's the Node we most recently took a left at. In either case, that's an additional O(log n), in the worst case, making next and prev also O(log n).

Remember that if we have a well-built hash table, on average, we have O(1) performance in insert, get, and remove. Worst-case, we have O(n) for get and remove, while insert can still be O(1).

How long does first() or last() take? Unfortunately, because we have a good hash function that scatters hashes all over our array, we have NO clues as to where the smallest or largest keys might be. So, there's no better way to do this than just iterate through and look at everything. O(n).

Similarly, we have no guarantee that two keys that are close to each other will be hashed close to each other. So, again, we have to look at everything. O(n).

Balanced Tree | Hashtable | |
---|---|---|

get(key) | O(log n) | O(1) average, O(n) worst-case |

set(key,value) | O(log n) | O(1) |

first() | O(log n) | O(n) |

last() | O(log n) | O(n) |

next(k) | O(log n) | O(n) |

prev(k) | O(log n) | O(n) |