Thursday, September 1, 2016

Maintaining Disjoint Sets under Union

Time for a new post. This time, we will cover an elegant data structure for maintaining a class of sets under the operation of union. Formally, we start with $n$ classes each consisting of a single element: $\{1\}, \{2\}, \dots, \{n\}$. These correspond to the equivalent classes of some equivalent relation. We would like to be able to quickly find which class an element belongs to and quickly join two classes together to form one class (destroying the old classes along the way). These two operations are known as find and union and the data structure is often times called union-find.

There are many ways to implement these operations but we will cut to the chase and implement find with path compression and union by rank.

Representation of a Class

We are going to use a tree to represent an equivalent class. Each node in the tree corresponds to an element in the class. Each node also points to its parent in the tree. The root of the tree points back to itself. The root of the tree also serves as the representative element of the class. The root node will also maintain the number of nodes in its tree (we will see why this is useful when we talk about union by rank). Initially, every element lives in a separate class, so the trees will look like this for $n=5$ (the size of the trees is dropped here).

Initial setting

Find with Path Compression

Given an element, we would like to find the class it belongs to (or the representative of that class since each class has a unique representative). Using the representation above, we just climb to the root by following parent pointers and return the root element. The path to the root could be long and so making several calls to find with the same argument could end up taking a lot of time. Path compression avoids this by making the nodes along the path (except for the root itself) point to the root. This renders subsequent calls to find very cheap in terms of running time. Here is a pictorial representation of path compression. The right tree is the new state we get after a call to $find(4)$. The path to the root is highlighted in red.

Find with path compression

Union by Rank

The union operation takes two disjoint classes as input and creates a new class that is the union of the elements in the input classes. An easy way to implement this is to take one of the root elements and make it a child of the other root element. The union by rank heuristic suggests that it is a good idea to make the root of the class with the smaller number of elements a child of the root of the class with the larger number of elements. In the following figure, the class $\{1,2\}$ is joined with the class $\{3,4,5\}$ by making $1$ a child of $3$.

Union by rank


For each node we need to know its parent node and the size of its subtree. We will use one array, $uf$ (for union-find), of integers to account for both pieces of data. $uf[x]$ gives the parent of a node when $x$ is not a root and it gives the size of the subtree when $x$ is a root. How do we differentiate between the two ? We simply negate the size when the element is a root. So if $uf[x]$ is negative we can say that $x$ is a root element and that $-uf[x]$ is the size of the tree rooted at $x$ and when $uf[x]$ is non-negative we can say that $x$ is a non-root element and that it is a child of $uf[x]$.

Initially, every element lives in its own tree and so is the root of that tree. We can write a function to initialize the array as follows:

The union operation is very easy to implement. First we find the root elements and then make the one with the smaller size a child of the one with the larger size.

The find operation is slightly harder to implement. We find the root element in a first pass and then make the nodes along the path children of the root in a second pass.

That is all for this post. I hope that it was helpful. As always, feedback and questions are welcome.