Hashtables

First, a note: I really like the chapter on Hashtables in your book. Check it out.

Hashtables are one of the really simple and effective ideas in computer science. They show up absolutely everywhere, because they work SO well.

Suppose in our Map, we were going to get 100 pieces of data added, and the keys would be the integers 0-99. How would you store them? You'd just put them in an array, right? The runtime of insert, get, and remove, would all be O(1). That'd be wonderful!

Of course, this is unrealistic for a couple reasons. First, we don't always know how many pieces of data we have coming in, and we don't always have integral keys. And, even if we do have integral keys, odds are they're not all small, positive numbers (large numbers would require too large an array to be worth it).

An O(1) insert, get, and remove is quite the prize, so let's not give up so quickly. Let's look at these problems one at a time.

My Keys Aren't Integers!

Well... make them integers. Consider Strings, for example. We know that a char is actually just an integer. So, maybe, just add the characters in the string? That gives us an int! Job done!

We can imagine coming up with a similar function for any type of sortable key. This function, unsurprisingly, is called a hash function, and it takes in a key and returns an int.

My ints Are Really Big (or are negative)!

Well, decide on a size for your array. Then, using absolute value and the modulo operator (%), you can make your ints be as small and positive as you need. Was that hard? Jeez, you're such a whiner.

But, won't this result in multiple Strings having identical hashes? Yes! This is called a collision, and avoiding them and handling them is the only hard-ish parts (not really that hard) of making a hashtable.

Making a Good Hash Function

So is any hash function as good as any other hash function? Well, no. First of all, a hash function has to be deterministic, which is to say, every time was call our hash function on our key, it has to result in the same number. Second, we want our hash function to minimize collisions. If our hash function for Strings gave us nothing but integers between 10 and 50, no matter what the String, that would be a bad hash function, as there would be a huge number of collisions! Writing good hash functions is an active area of research for computer scientists and number theorists.

Java's Object class defines a hashCode() method, but it's not very good, because Object class level, we don't know anything about the object at all. This means they base it on the address the object is stored at. This is problematic. Suppose we have stored a String "hello" in our hashtable, and now want to find it by calling get("hello"). Because these two objects are stored at different address locations, we would get different hashes, and this get wouldn't work.

Fortunately, for Strings, Java has overridden this method to avoid that problem. Check out the Javadoc! If you're implementing a hashtable for Strings, you should feel free to use this.

Handling Collisions

Despite our best attempts, our hash function has resulted in a collision: two keys, after being absolute value'd and modulo'd, have resulted in the same hash integer. This happens pretty often, as demanded by the Birthday Problem. We have two ways to handle this: Open Addressing, and Separate Chaining.

Open Addressing

Array full at the index you'd like to insert something? Move to the next one. Your find, then, goes to where it'd like to be inserted, then moves along linearly, until it finds an empty space. There are other, more sophisticated approaches as well.

No matter how sophisticated, though, the problem remains: if the array isn't big enough, the array fills up, and now insert() and get() become O(n) operations. With a big enough array, though, this works well.

Separate Chaining

More commonly used, though, is separate chaining. In this, we have an array of Linked Lists (or Trees, or something). Upon hashing to a certain spot, you add that element to the data structure of all keys that have been hashed to that same index. get() just has to find that key in that data structure.

Runtime of Maps Implemented With Hashtables

We'll assume a good hash function, and an array of size k, into which we've hashed n keys. We'll also assume we've used Separate Chaining.

On average, there are \(\frac{n}{k}\) elements hashed into each index. This number is known as the load factor. If k was chosen with no knowledge of n, then this means there are O(n) elements stored in each bucket. However, if k was chosen to be some fraction of n, then there are O(1) elements stored in each bucket.

Fortunately, we can resize the array as things are added. In practice, this is usually accomplished where if the load factor becomes larger than .75, we double the size of the array and rehash everything. So, we can assume there are O(1) elements stored in each index.

To do an insert, we perform the hash (O(1)), and insert the element into that spot in the array (O(1)). So, the runtime of insert is, indeed, O(1).

To do a get, we perform the hash (O(1)), and then get the element from the data structure stored at that spot (O(1)).

To do a remove, we perform the hash (O(1)), and remove it from the data structure at that index (O(1)).

Holy crap! Everything is O(1)! Why the hell did we just spend this long on Trees? Well, in the worst case, everything hashes to the same index, and get and remove are actually O(n). But also, there are other ADTs...