Union-find Algorithms

August 25, 2013

Started following Coursera-Princeton’s Algorithms 1.
Being a non-computer scientist, I started missing these theoretical foundations in some of my recent scientific work, so this course might fill a few gaps.
Too bad they don’t offer any completion certificate (as other Coursera courses do), therefore I won’t be submitting the programming exercises (Java, btw).
Instead, I’ll work on the assignments in Python, to my heart’s content.

We start from the very top: $n$ objects belonging to $m \leq n$ disjoint sets (“connected components”, in graph terms).
Often one wishes to query whether two items belong to the same set and, in case, merge two sets.
If sets are represented as trees, a common tree root node indicates that two given items belong to the same set.

Tree root-find with recursion
Data structure: $L$ is a pointer list, a list of integers which can be traversed recursively to find the root (i.e. root of element $i$ is $L[ ... L[ i ] ... ]$ ):

def root(L,ii):
"ridx = tree root of element ii"
"rd = depth of ii"
ridx = ii
rd = 0
if L[ii] != ii:
ridx, rd = root(L, L[ii])
rd += 1
return ridx, rd


The branch() method outputs the list of nodes connecting ii to its root:

def branch(L,ii):
"return branch from ii to root as list"
Lout=[ii]
if L[ii] != ii:
Lout = Lout + branch(L, L[ii])
return Lout


There is an interesting discussion about recursion in Python and its real implications here, the take-home message being simply that one should not implement iterative procedures (‘loops’) as recursion.

Quick-Union
This algorithm needs the items to be arranged in a set of trees (a ‘forest’), so that the set query is simplified to an identity test of the respective tree roots.
Using the code above, this simply reads:

def qu(L, ii1, ii2):
"attach subtree of ii2 to root of ii1"
ridx1, rd1 = root(L, ii1)
ridx2, rd2 = root(L, ii2)
L[ridx1] = ridx2
return L


A shortcoming of this approach is that trees tend to get tall and skinny, so $k$ searches in them, in the worst case, require $O(k n)$ operations.

Weighted Quick-Union
The complexity of a search on Union-Find trees can be kept manageable by weighting, i.e. one subtree in WQU is attached to the second’s root if the former’s size is smaller.
A more precise notion of size is in order: e.g. the total number of elements in the subtree or the subtree depth.
If we settle for the latter definition (union-by-depth),

def wqu(L, ii1, ii2):
"weighted quick union"
"attach subtree of ii2 to subtree of ii1,"
"if the root distance of former is smaller"
ridx1, rd1 = root(L, ii1)
ridx2, rd2 = root(L, ii2)
if rd1 <= rd2:
L[ridx1] = ridx2
else:
L[ridx2] = ridx1
return L


This balancing helps to keep the complexity of a search on $L$ at $n+k\log_2(n)$ (where $\log_2(n)$ is the average number of subtree size doublings).

Weighted Quick-Union with Path Compression
An optimization on WQU is given by “path compression”: at union time, all the parents of the node to be attached are set to be the root node of the larger tree.

def wqupc2(L, ii1, ii2):
"WQU with path compression"
B1 = branch(L, ii1)
B2 = branch(L, ii2)
rd1 = len(B1)
rd2 = len(B2)
ridx1 = B1[-1]
ridx2 = B2[-1]
if rd1 <= rd2:
for ii in B1:
L[ii] = ridx2
else:
for ii in B2:
L[ii] = ridx1
return L


While doing what it says on the label, the code above uses two recursive calls to find the branch roots (with the known memory and depth size limitations of recursion), storage of the branches and a loop along the smaller of the two branches.
A smarter implementation is clearly necessary for large-scale applications.
If the search tree can be “flattened out” maximally, the complexity of a union search can be further reduced so as to approach (but never reach) the size $n$ of the data items.
That is, $k$ union-find operations on $n$ objects organized in a flattened tree need less than $c \left( n + k \log^*_2(n) \right)$ array accesses, where $\log^*()$ is the iterated log function (Hopcroft-Ullman, Tarjan).
BTW, the iterated logarithm is defined as the number of times $\log()$ has to be applied in order to obtain $\leq 1$, and is an extremely slowly growing function (e.g. $\log^*(2^{2^{16}})\sim 5$).

Results and conclusions
We list the theoretical worst-case computational burden, along with typical figures for the average branch depth of the trees generated with each algorithm (obtained with sets of n=1000 integers, after 10^5 random union operations):

• Quick-union
Complexity of k searches on n items: O(k n)
Average branch depth range: 124 – 133
• Weighted quick-union
Complexity of k searches on n items: O( n + k log2(n) )
Average branch depth range: 5.3 – 8.3
• Weighted quick-union with path compression
Complexity of k searches on n items: O( n + k log*2(n) )
Average branch depth: 2.01 (variations at the third decimal)

Path compression clearly minimizes the overhead associated with finding the set label of large collections of items.
Simple code leading to very deep mathematics and enabling huge-scale computation .. my appetite, on the other hand, is growing quickly.