Unit 4: Hashing

This is the archived version of this course from the Spring 2013 semester. You might find something more recent by visitning my teaching page.

In this unit, we are going to look at techniques for storing data based on *hash functions*. We had a few problems discussing the relationship between hash functions and random numbers earlier. Basically, a hash function is supposed to *appear* random, without actually *being* random. One way to think of hash functions is a way of getting "repeatable randomness".

We will use the probability techniques we have been learning to solve and understand problems with hash tables and other hashing-based structures. Initially we'll assume that the hash functions behave just like random numbers, and then analyze these data structures as if they're randomized data structures. Then we'll look at how random numbers are actually used to *build* hash functions, that will actually guarantee the properties we need.

- Wikipedia: Hash table

Hash tables are all about using mathematical functions to implement lookup dictionaries faster than the linked structures such as AVL trees or Skip Lists, at the cost of potentially using extra memory.

There are two main drawbacks to hash tables, which we won't dwell on, but which you should be aware of. First, iterating through the elements of a hash table *in order* is going to be very slow - it would require you to write out all the elements and then sort them. Since this is usually very easy to do in other dictionary data structures, it might be a reason not to choose hash tables if you need to iterate through the data with some high frequency.

The other drawback is the difficulty in choosing good hash functions and getting all the details right with the size, dealing with collisions, and so on. These difficulties are often ignored when teaching hash tables, to make things simpler. So of course we're going to drive right at them!

We will however make the following assumptions at all times:

- All the keys that are being hashed are integers. There is a lot of interesting theory in how to map other things like arrays or strings to integers so that they can be hashed. We will not talk about that step much here.
- Any collision resolution is done by
*separate chaining*with linked lists, unless otherwise stated.

Our typical set-up for the hash table problem will be as follows: Each key is an integer from 0 up to \(m-1\), where \(m\) might be very large. This is usually called the "universe". Out of this size-\(m\) universe, \(n\) distinct keys are to be stored in the hash table. The size (a.k.a. number of "buckets" or "slots") in the hash table is \(s\). The *hash function* \(h(x)\) maps integers from the universe down to the range of hash table slots \(\{0,1,2,3,\ldots,s-1\}\).

Typically, the goal will be linear size (so that \(s\in O(n)\)) and \(O(1)\)-time to look up, insert, or delete something from the hash table.

In a hash table that uses separate chaining, the size of the largest bucket determines the worst-case cost of a single lookup, insert, or delete operation. So it makes sense to investigate this quantity as the most important one in measuring the performance of a hashing scheme.

To begin, assume that we have \(s=2n\) and \(h(x)\) behaves totally randomly. Even though hash functions are *never* randomized (you would never be able to look up the thing you just stored!), we often use the same techniques from randomized analysis to analyze their performance.

(Actually, a little later, we will use random numbers to *choose* a hash function, but still random numbers are never used *by* the hash function to compute its output.)

Problem 76 looks at the "nuts and bolts" of the computation to show that, with probability at least \(1/2\), the largest bucket in this hash table will have

\[\Theta\left(\frac{\log n}{\log\log n}\right)\]

elements in it. In cat, the *expected* size of the largest bucket has that same asymptotic behavior as well.

Now the question becomes "so what?". Should we be happy with this max bucket size? Well, it means that the worst-case cost of all of the dictionary operations will be \(\Theta((\log n)/(\log\log n))\), and since \(\log\log n\) is a very slow-growing function, the worst case cost is very close to \(O(\log n)\). That's the same as AVL trees or Skip Lists! So this is really not very good; we haven't gained anything over the much simpler, and much smaller, tree-like dictionary structures.

And remember, this is the very best kind of hash function we could imagine, one that behaves just like random. If we don't choose such a great hash function, the behavior could be even worse. Fortunately, there are some pretty simple ways to make this work better!

- Wikipedia: 2-choice hashing

As we will say, the key to many improvements on hashing is: *use more than one hash function*. This approach is not an instant solution; there must be some clever, "intelligent" way to make use of more hash functions. And it also brings in the difficult challenge of choosing multiple hash functions *that have absolutely no relationship to each other* (mathematically).

With that in mind, let's look at a simple way of using two hash functions to improve the performance of a regular hash table. The basic idea is this: we will have two hash functions \(h_1(x)\) and \(h_2(x)\), each of which works the same (mapping integers from the entire universe into the range \(\{0,1,2,\ldots,s-1\}\) of hash table indices). Then element \(x\) can either be stored at position \(h_1(x)\) or position \(h_2(x)\). Specifically, here is how the two most important hash table functions would work:

**Insertion**: To insert \(x\) into a 2-choice hash table, we compute the two hash values \(h_1(x)\) and \(h_2(x)\). Then, whichever of those two buckets has a smaller size determines which bucket \(x\) will be inserted into.**Search**: To search for \(x\) in a 2-choice hash table, just look through*both*buckets at \(h_1(x)\) and \(h_2(x)\), and return the location if \(x\) is found in either bucket.

Once again, the performance will depend on the size of the *largest bucket* in this hash table. And again, we could get some very "unlucky" situations where all the elements inserted happen to hash to just a small number of buckets, which will then all be relatively large in size, destroying the performance of the hash table.

But now that there are two choices, and since we're assuming the hash functions have no *correlation* between them, the chance of this happening is greatly reduced. Think about it: even if we happen to pick a bunch of insertions whose \(h_1(x)\) values are all the same, their \(h_2(x)\) values should still be different, since the two are unrelated, and then everything will be fine.

What difference does this make? Well, it would seem logical to assume that, since there are 2 hash functions, the expected max bucket size is maybe half as large as what it would be with a single hash function. But actually, it's much, much better than that. The expected size of the largest bucket, in 2-choice hashing, turns out to be only

\[\Theta(\log\log n)\]

This is a *massive* improvement from the \(\Theta((\log n)/(\log\log n))\) of regular hashing. In fact, it's like improving a \(\Theta(n)\)-cost algorithm to \(\Theta(\log n)\). Like I said, massive! This should be surprising, that we get such a tremendous improvement with only 2 hash functions instead of 1. This idea gets used in a number of different places in randomized algorithms, so much so that it even has a name: **the power of two choices**.

Now you might think, if 2 hash functions gives that much improvement, 3 hash functions must be even better! Unfortunately, the answer here is negative. The improvement from 2-choice to 3-choice hashing is marginal, and the improvement from 3-choice to 4-choice is even less, and so on. Plus, every time we add another hash function, that increases the time of doing every operation, since you have to actually compute all those hash values, look through all those buckets, and so on. This is in fact the lesson of the "empirical study" in Problem 68: 2-choice hashing gives the best balance between bucket size and number of hash functions, both in theory and in practice.

- Wikipedia: Set ADT

So far we have looked at hash tables as a way to implement the Dictionary ADT that supports the operations of search, insert, and delete.

But in many applications, we don't actually need all this functionality. For example, a password checker might be interested in whether your new password is the same as any of your previous passwords. This program doesn't need to know what your previous passwords are, it just needs to check whether this new password is among those previously entered.

What we are talking about here is a *set* ADT that supports just two operations:

**Insert**: Add \(x\) to the set, if it's not already there.**Membership**: Test whether or not \(x\) is already in the set.

The important thing to notice here is that there is no way to actually retrieve an element of the set, once it has been inserted. We can only test whether a *given* element is present or not. So unlike in a dictionary, in a set we can get away with (potentially) smaller storage. Let's see how.

The first idea is to use a normal hash table, but instead of storing buckets at each position, we will just store true/false boolean values. So when a new element \(x\) is inserted, we compute \(h(x)\) and set that table position to "true", and to check membership we simply look up the true/false value stored in the hash value's position in the table.

Now consider the costs: since there are no buckets, every entry in the hash table just has \(O(1)\) size. Since it's also \(O(1)\) to compute the actual hash function, every operation on this kind of a hashed set is \(O(1)\) in the worst case. Hooray!

Except that there's a caveat: *we can get wrong answers*. As you know, the problem with all hash tables is *collisions*, when two different elements hash to the same value. So what happens if \(h(x) = h(y)\), and we insert \(x\), and then test \(y\) for membership? Well, since \(x\) was inserted, the hash table value at index \(h(x)\) will be "true", and since \(h(y)\) has the same hash value, the membership test for \(y\) will incorrectly return "true" as well.

This is called a *false positive* and it's going to cause problems with this particular idea for storing a set. In particular, the birthday problem (sometimes called "birthday paradox") tells us that the size must be at least \(s \in \Theta(n^2)\) in order to guarantee a small probability of a collision. But that's ridiculous - any savings we would have over a regular hash table (since we don't have to store entire buckets) would be immediately wiped out by this huge table size requirement. So far, it seems that regular hash tables are still the best way to implement the set ADT.

- Wikipedia: Bloom filter

But wait! We know a trick to improve the performance of hashing algorithms: *use multiple hash functions*. But we will have to change the way things work for sets. Just using the idea of 2-choice hashing directly doesn't even make sense any more, since there are no buckets whose sizes we can compare!

Instead, we will use \(k\) different hash functions \(h_1(x), h_2(x)\),, h_k(x)$, and on each insertion, we set *every one* of these hash table entries to "true". Membership tests will then involve looking up every one of the hash table indices in the table, and checking whether they are all true.

Why is this any better than using a single hash function? We'll see the answer mathematically in a moment, but intuitively the reason is that we are now storing *patterns* instead of individual entries. Think back to the birthday problem: why is it so likely that 2 people will share the same birthday, even with a relatively small group of people compared to the number of possible birthdays? It's because the number of *combinations* of two people can be much larger than the actual number of people. And it's this number of potential pairs of individuals that makes the birthday problem work the way it does.

So we are now applying this same paradox *in our favor* to build a more compact data structure! The number of distinct hash values is going to be relatively small, resulting in a large number of collisions. But the number of *combinations* of hash values can still be very large, so if we use more hash functions, this number of combinations increases, and the number of collisions for *that same combination* should decrease correspondingly.

This data structure is called a *Bloom filter*, and its performance depends critically on two parameters: the number of hash functions \(k\), and the size of the table \(s\).

To analyze how big we want these parameters to be, we need some math of course! Here's an overview of how we'll do this: First, we figure out the probability of a membership test returning a false positive. This will be a formula in terms of \(k\) and \(s\). Next, we set that probability to a constant value (assuming we want a small, constant failure probability), and figure out the optimal values of \(k\) and \(s\) to achive that failure rate.

**Step 1: Probability of a table entry set to "true"**

The probability that any given entry in the table is set to "true" is the chance that *any* of the \(k\) hash functions, for *any* of the \(n\) insertions, returned that index. As is often the case, the easiest way to compute this probability is actually to do the opposite: compute the chance that a given entry is *not* set to "true". This means the probability that all of those \(nk\) hash function evaluations returned something other than the given entry. Since there are \(s\) possibilities for each hash function value, this works out to:

\[\mathrm{Pr}[\text{table entry is False}] = \left(1 - \frac{1}{s}\right)^{nk}\]

Now here we can use an important approximation. Remember from calculus that the special constant \(\mathbb{e}\) can be defined as:

\[\mathbb{e}= \lim_{n\to +\infty} \left(1 + \frac{1}{n}\right)^n\]

Well we can take the inverse of both sizes to get:

\[\frac{1}{\mathbb{e}} = \lim_{n\to +\infty} \left(1 - \frac{1}{n}\right)^n\]

This tells us that, when the value of \(n\) is really large, the left and right-hand sides are going to be very close to each other. So we have a general approximation that's useful in many probability proofs:

\[\left(1 - \frac{1}{a}\right)^b \approx \mathbb{e}^{-b/a}\]

(if \(a\) and \(b\) are both large integers)

Plugging this in to the probability above, we know that

\[\mathrm{Pr}[\text{table entry is False}] \approx \mathbb{e}^{-nk/s}\]

and therefore

\[\mathrm{Pr}[\text{table entry is True}] \approx 1 - \mathbb{e}^{-nk/s}\]

**Step 2: Probability of a false positive**

To get a false positive, the table entires at *all* \(k\) of the hash values must be "true". Using the formula above, we know that this probability is

\[\mathrm{Pr}[\text{false positive}] = p \approx \left(1 - \mathbb{e}^{-nk/s}\right)^k\]

**Step 3: Optimal value of k**

We naturally want the probability \(p\) above to be as small as possible. To achieve that, notice first that \(\mathbb{e}^{-nk/s}\) is going to be some number between 0 and 1, and therefore the base of this big power is going to be something between 0 and 1. If you differentiate the equation above in terms of \(k\), set it to 0, and solve for \(k\) (which I'm not going to show you!), you will discover that this function is minimized when the base of the exponent equals one half:

\[1 - \mathbb{e}^{-nk/s} = \frac{1}{2}\]

Solve that for \(k\) to discover the optimal value for \(k\), given \(n\) and \(s\), is

\[k = \frac{s}{n}\ln 2\]

**Step 4: How to figure out s**

Given the optimal value of \(k\) that we've just figured out, we can plug this back into the formula for \(p\) to discover that

\[\mathrm{Pr}[\text{false positive}] = p \approx \left(\frac{1}{2}\right)^k = \frac{1}{2^{(s/n)\ln 2}}\]

Now solve that equation for \(s\). You take the log base 2 of both sizes, then multiply both sides by \(n/\ln 2\) to get:

\[s = \frac{-n\lg p}{\ln 2}\]

**Summary of choosing Bloom filter parameters**

To summarize all this, if you want to choose your parameters for a Bloom filter, given the number of things that are going to be in it \(n\), here's what you do:

- Pick the desired failure probability \(p\)
- Compute the size of the Bloom filter \(s = (-n\lg p)/(\ln 2)\). Be careful about \(\lg\) and \(\ln\) - they aren't the same!
- Compute the optimal number of hash functions \(k = (s/n)\ln 2\)

And that's it! It's important to notice that, as our desired failure probability \(p\) decreases, \(-\lg p\) will increase, and so will \(s\) and \(k\) proportionally. This is what we expect!

- Wikipedia: Universal hashing

Now that we have all these fancy ways to use multiple hash functions and make things better, the question of choosing high-quality hash functions becomes really important. So how are we going to choose all these hash functions?

So far we've assumed that our hash functions are going to behave exactly like random numbers. Well, achieving this goal perfectly is an impossibility if we want the hash functions to also be fast. We saw in class why this is: because of the extremely large number of possible functions from the universe of \(m\) possible keys to the \(s\) possible hash values.

However, if we back off on the requirements slightly, it is possible to achieve what we are after. The key here is that we will *choose the hash function randomly* from a set, or "family", of possible hash functions. Once the hash function is chosen - presumably when the data structure is first initialized - the choice of hash function never changes. However, this single random choice allows us to use the power of randomization to prove good *expected* performance of the hashing techniques we have covered so far.

So what properties of random hash functions do we really want? The most basic would be that the probability of collisions in the random hash function is the same as if the hash values were just chosen randomly. This requirement is captured in the following definition:

DefinitionFor a universe of \(m\) integers, and a hash table size \(s\), a

universal family of hash functionsif, for a randomly chosen function \(h(x)\) from the family, and for any two elements \(a, b\) in the universe of \(m\) integers, we have\[\mathrm{Pr}[h(a) = h(b)] = \frac{1}{s}\]

Again, this corresponds exactly to what the probability would be if the individual hash *values* \(h(a)\) and \(h(b)\) were chosen randomly, so the statement above is as good as random in terms of collisions between a pair of elements.

Now the question is, what mysterious functions might satisfy this property? The simplest and perhaps most useful family of universal hash functions were discovered by Carter & Wegman, and they are structured as follows:

Carter & Wegman Hash FunctionsLet \(p\) be any prime larger than \(m\). Then the Carter and Wegman family of hash functions consist of functions \(h(x)\) which satisfy

\[h(x) = \left(\left(ax + c\right) \bmod p\right) \bmod s,\]

for integers \(a, c\) chosen randomly from the set \(\{0,1,2,\ldots,p-1\}\).

Note that the prime number \(p\) does *not* need to be chosen randomly - it should be the same for every possible hash function \(h(x)\). Note that sometimes these hash functions are written like \(h_{a,b}(x)\), to emphasize that the function itself will be different for every different choice of \(a\) and \(b\).

In class we saw the proof that the Carter & Wegman functions are universal. You should also notice the striking similarity of this hash function to the mixed LCG pseudo-random number generator from the beginning of the course. It's not a coincidence - good hash functions behave like random numbers!

Because of this, you should also not be surprised that there are more sophisticated families of universal hash functions based on more compilcated formulas. A straightforward extension of the Carter & Wegman hash functions is the following family:

\[h(x) = \left(\left(a_0 + a_1x + a_2x^2 + \cdots + a_kx^k\right) \bmod p\right) \bmod s\]

where all of the "coefficients" \(a_0, a_1,\ldots, a_k\) are randomly chosen. These hash functions take a bit longer to construct and to compute, but they satisfy even stronger randomness properties. (In particular, they form a \(k\)-universal family, if you want to look up what that means.)

- Wikipedia:
**nothing**!

The final hashing problem we will consider is called *static hashing*. It's just like the regular dictionary ADT problem, except that there are no "insert" or "delete" operations, only look-ups. In other words, the contents of the dictionary never change after it is constructed.

There are many applications of static hashing: namely, any kind of database whose keys will never or rarely change. For example, a dictionary that actually stores definitions of English words would not change very frequently. Neither would a look-up table for chemical elements, or the database mountain peak locations... you get the idea.

There is a very simple and fairly effective way to store this kind of data: just put it in a sorted array. Then running the *binary search* algorithm on the sorted array could achieve \(O(\log n)\) time for any look-up operation. Since this is close to the performance of most dictionary data structures, but far simpler, it seems like a good idea.

Well, as it turns out, there is a better way. *Perfect hashing* is the name given to a way of storing a set of \(n\) elements in a kind of a hash table that has \(O(n)\) total size and worst-case cost \(O(1)\) for any look-up operation. This solution was discovered by Fredman, Komlos, and Szemeredi in 1984, and is sometimes also called "FKS hashing".

Here is a first idea for perfect hashing: use a hash table large enough so that there will be no collisions among the \(n\) elements. Since the set of elements is unchanging, we can just try different randomly-chosen hash functions until we find one with no collisions.

But how hard will we have to search to find such a hash function? Well, that depends on the size of the table. But the birthday problem has already shown us that, unless the size of the table is \(O(n^2)\) in size or greater, there is a high likelihood that at least 2 of the elements will have a hash collision.

If we set \(s=n^2\), then the probability of any two elements colliding is \(\frac{1}{s} = \frac{1}{n^2}\). Furthermore, the number of pairs of elements that *could* collide is \(\binom{n}{2} = \frac{n(n-1)}{2}\). Therefore the probability that there are no collisions at all is at most

\[\frac{n(n-1)}{2} \cdot \frac{1}{n^2} = \frac{n^2-n}{2n^2} \lt \frac{1}{2}.\]

Now a 50% chance of success doesn't seem very good, but by now we know that this probability decreases exponentially after repeated tries. The expected number of hash functions we will have to try before finding one that has no collisions (for \(s=n^2\)) is just \(O(1)\).

So this idea satisfies the time requirements of perfect hashing: since every bucket has size 1 or 0, every look-up operation is just \(O(1)\) time in the worst case. But the *space* requirement of perfect hashing is defintiely not satisfied here.

The big trick to get \(O(1)\) time and \(O(n)\) space for static hashing is to use a *2-level hash table*. Here's how it works: the top-level hash table will have size \(n\) and numerous collisions, so the elements are stored in buckets. But each bucket is itself another hash table, and the 2nd-level bucket hash tables are constructed in such a way that they don't have any collisions. So overall, the cost of any look-up is just computing two hash functions, and since there are no collisions on the second level, this is \(O(1)\) time in the worst case.

Specifically, there is a single top-level hash function \(h(x)\), that is chosen randomly according to the constraints that will be outlined above. After choosing \(h(x)\), there will be \(n\) buckets of varying sizes; call these sizes \(k_1, k_2,\ldots, k_n\).

Each second-level hash table (the buckets) will have its own size \(s_i\) *and* its own hash function \(h_i(x)\). We will use the idea from above to construct these so that there are no collisions among the \(k_i\) elements: set \(s_i=k_i^2\) and choose random hash functions \(O(1)\) times until we find one that produces no collisions.

There are two remaining questions: how the top-level hash function \(h(x)\) is chosen, and why in the world we would expect the total size of this structure to be only \(O(n)\).

The crucial value to consider is the expected total number of pairwise collisions in the top-level hash table. Since there are \(\binom{n}{2}\) pairs of elements, and each pair has a collision with probability \(1/n\), the total expected number of collisions is at most

\[\binom{n}{2}\cdot \frac{1}{n} = \frac{n-1}{2} \lt \frac{n}{2}\]

Since this is just the expected number, it won't come out this way for every choice of a random hash function. But after expectsd \(O(1)\) choices of a random top-level hash function \(h(x)\), we will find one that has only \(n/2\) total pairwise collisions.

Why is this important? Well, the total size of the data structure is the size of all the 2nd-level hash tables:

\[s_1 + s_2 + \cdots + s_n = k_1^2 + k_2^2 + \cdots + k_n^2\]

In class, we saw that each \(k_i^2\) corresponds to the number of pairwise collisions within that bucket. So the total size above is actually just equal to \(n\) plus 2 times the total number of pairwise collisions in the top-level hash table. Since \(h(x)\) was chosen so that this number of collisions would be only \(n/2\), the total size of the 2nd-level bucket tables is at most \(2n\), making the total size of this FKS perfect hashing structure \(O(n)\), as required. Hooray!