Union-Finds

Our next graph algorithm is going to need a new, important data structure: The Union-Find (aka disjoint-set forests). You need a Union-Find when you have a set of things, each of which are separated into disjoint subsets. A Union-Find must perform the following methods (their names will not surprise you):

So how does this work? Each subset will be represented with a tree, where each element stores a pointer to its parent, rather than parents storing pointers to children. The root of that tree, accessible by following parent pointers from any element of that subset, will be the "representative" of that subtree, which is used to identify the subset. For example, consider the following tree:

Running find(x), where x is any of the vertices labeled from A-F, will return the vertex labeled B. How long does find take? Well, it depends on the height of the tree.

Now, consider the following forest (so-called, because it's a bunch of trees).

This isn't very interesting; each vertex is in its own subset, making for some very boring trees. Now, we'll start calling unions. The way this will work is, the representative of the shorter tree will point to the representative of the taller tree. For example, let's call union(A,B). Now, these two are the same height (1), so it doesn't matter which one becomes the representative of the joined trees. Let's make it B:

Now, we'll call union(A,C). So, first we have to find the representative of A's subset (which is B), and the representative of C's subset (which is C). C's tree is shorter than A's tree, so we have the representative of C's tree (which is C) point to the representative of A's tree (which is B), getting this:

We'll now call union(D,E). Again, they are the same height (1), so we merge them like so:

We now call union(C,E). Again, both trees are the same height, so it doesn't matter if B or D is our new representative. Let's do it this way, so that B points to D:

We now call union(F,A). F's tree is shorter than A's tree, so we merge them like so:

Runtime

For both union and find, we have to find the representative from a given vertex. This takes \(O(height of the tree)\) time. Now, if we assume we started with a bunch of trees consisting of a single vertex, and if we stick to having the smaller trees pointing to the bigger tree's representative on a union, it turns out that the height of our tree is \(O(\log(n))\)! So, both union and find are \(O(\log(n))\).

We can make this faster! There is no limit to the branching facter of our trees. The above tree is equivalent to this:

So, what if every time we had to traverse a tree, from a given node to the representative, we kept track of every node we saw along the way. Once we find the representative, set the parents of each of those nodes to be the representative. We just flattened the tree, making everything faster! Note this is not the same thing as guaranteeing a tree is of height 2; we're only flattening vertices we encounter during one of our unions or finds.

It turns out that if we do this, our trees stay very, very short. In fact, the amortized runtime of union and find become \(O(\alpha(n))\), where \(\alpha\) is the inverse Ackermann function. Here's the Ackermann function:

\(A(x,y)= \begin{cases} y+1,&\text{if }x=0\\ A(x-1,1)&\text{if }y=0\\ A(x-1,A(x,y-1))&\text{otherwise} \end{cases}\)

The inverse Ackermann function \(\alpha(n)\) is the value of \(x\) such that \(A(x,x)=n\).

Now, the Ackermann function grows spectacularly quickly, so the inverse Ackermann function grows spectacularly slowly. How quickly does the Ackermann function grow? \(A(5,5)=9876!\), or about \(6\times 10^{35,163}\). That's ridiculously huge; a 6 followed by over 35 THOUSAND zeros. So, for any practical value of \(n\), \(\alpha(n)\leq 4\). Practically speaking, both methods of Union-Finds now run in constant time.

Minimum Spanning Trees

A spanning tree of a graph is a tree build from the vertices and edges of the graph that contains every vertex of the graph. Remember, a tree has to be connected, and have no cycles.

On a weighted graph, a minimum spanning tree is a spanning tree whose edges contain the smallest sum of weights.

Why would we care? Imagine we want to lay pipe to provide plumbing to every house on a block. What network of pipes would allow us to do this with the minimum amount of pipe?

There are two famous algorithms to solve this problem, Kruskall's Algorithm, and Prim's Algorithm. Both are based around one simple fact: If you divide any weighted graph into two sets of vertices \(V_1\) and \(V_2\), of all edges that connect a vertex in \(V_1\) with a vertex in \(V_2\), the one with the smallest weight (which we will refer to as edge e) MUST appear in the minimum spanning tree. Why?

  1. Imagine a MST that does not contain e.
  2. We add e. We must have created a cycle.
  3. If f is the edge in the MST that went between \(V_1\) and \(V_2\), then we can now remove f, and still have a spanning tree, due to the presence of e.
  4. Either the total weight of the tree just decreased, in which case it was not a MST to start with, or it is the same, in which case e or f is fine.

In this course, we will only learn Kruskall's.

Kruskall's Algorithm

  1. Create a forest F, where each individual vertex in the graph is a tree in and of itself.
  2. Fill a min-Priority Queue with all the edges in the graph.
  3. While the priority queue is not empty, and F consists of more than one tree,
    1. Remove the edge with minimum weight from the PQ. Let's call the vertices joined by this edge A and B.
    2. If find(A) != find(B) add the edge to the MST, and perform a union(A,B), combining two trees into one.

At the end of this, the edges chosen will consist of a single MST. Runtime? With a heap as the priority queue, we first heapify the edges (\(O(m)\), then remove each edge one by one (\(O(m\log(m))\)), and perform two finds and a union (\(O(\alpha(n))\), which, remember, is essentially 4). So, the total runtime is \(O(m\log(m))\).