Union-find Algorithms

August 25, 2013

Started following Coursera-Princeton’s Algorithms 1.
Being a non-computer scientist, I started missing these theoretical foundations in some of my recent scientific work, so this course might fill a few gaps.
Too bad they don’t offer any completion certificate (as other Coursera courses do), therefore I won’t be submitting the programming exercises (Java, btw).
Instead, I’ll work on the assignments in Python, to my heart’s content.

We start from the very top: n objects belonging to m \leq n disjoint sets (“connected components”, in graph terms).
Often one wishes to query whether two items belong to the same set and, in case, merge two sets.
If sets are represented as trees, a common tree root node indicates that two given items belong to the same set.

Tree root-find with recursion
Data structure: L is a pointer list, a list of integers which can be traversed recursively to find the root (i.e. root of element i is L[ ... L[ i ] ... ] ):

def root(L,ii):
    "ridx = tree root of element ii"
    "rd = depth of ii"
    ridx = ii
    rd = 0
    if L[ii] != ii:
        ridx, rd = root(L, L[ii])
        rd += 1    
    return ridx, rd

The branch() method outputs the list of nodes connecting ii to its root:

def branch(L,ii):
    "return branch from ii to root as list"
    Lout=[ii]
    if L[ii] != ii:
        Lout = Lout + branch(L, L[ii]) 
    return Lout

There is an interesting discussion about recursion in Python and its real implications here, the take-home message being simply that one should not implement iterative procedures (‘loops’) as recursion.

Quick-Union
This algorithm needs the items to be arranged in a set of trees (a ‘forest’), so that the set query is simplified to an identity test of the respective tree roots.
Using the code above, this simply reads:

def qu(L, ii1, ii2):
    "attach subtree of ii2 to root of ii1"
    ridx1, rd1 = root(L, ii1)
    ridx2, rd2 = root(L, ii2)
    L[ridx1] = ridx2
    return L

A shortcoming of this approach is that trees tend to get tall and skinny, so k searches in them, in the worst case, require O(k n) operations.

Weighted Quick-Union
The complexity of a search on Union-Find trees can be kept manageable by weighting, i.e. one subtree in WQU is attached to the second’s root if the former’s size is smaller.
A more precise notion of size is in order: e.g. the total number of elements in the subtree or the subtree depth.
If we settle for the latter definition (union-by-depth),

def wqu(L, ii1, ii2):
    "weighted quick union"
    "attach subtree of ii2 to subtree of ii1,"
    "if the root distance of former is smaller"
    ridx1, rd1 = root(L, ii1)
    ridx2, rd2 = root(L, ii2)
    if rd1 <= rd2:
        L[ridx1] = ridx2
    else:
        L[ridx2] = ridx1
    return L  

This balancing helps to keep the complexity of a search on L at n+k\log_2(n) (where \log_2(n) is the average number of subtree size doublings).

Weighted Quick-Union with Path Compression
An optimization on WQU is given by “path compression”: at union time, all the parents of the node to be attached are set to be the root node of the larger tree.
A (single-pass) WQUPC code reads:

def wqupc2(L, ii1, ii2):
    "WQU with path compression"
    B1 = branch(L, ii1)
    B2 = branch(L, ii2)
    rd1 = len(B1)
    rd2 = len(B2)
    ridx1 = B1[-1]
    ridx2 = B2[-1]
    if rd1 <= rd2:
        for ii in B1:
            L[ii] = ridx2
    else:
        for ii in B2:
            L[ii] = ridx1
    return L   

While doing what it says on the label, the code above uses two recursive calls to find the branch roots (with the known memory and depth size limitations of recursion), storage of the branches and a loop along the smaller of the two branches.
A smarter implementation is clearly necessary for large-scale applications.
If the search tree can be “flattened out” maximally, the complexity of a union search can be further reduced so as to approach (but never reach) the size n of the data items.
That is, k union-find operations on n objects organized in a flattened tree need less than c \left( n + k \log^*_2(n) \right) array accesses, where \log^*() is the iterated log function (Hopcroft-Ullman, Tarjan).
BTW, the iterated logarithm is defined as the number of times \log() has to be applied in order to obtain \leq 1, and is an extremely slowly growing function (e.g. \log^*(2^{2^{16}})\sim 5).

Results and conclusions
We list the theoretical worst-case computational burden, along with typical figures for the average branch depth of the trees generated with each algorithm (obtained with sets of n=1000 integers, after 10^5 random union operations):

  • Quick-union
    Complexity of k searches on n items: O(k n)
    Average branch depth range: 124 – 133
  • Weighted quick-union
    Complexity of k searches on n items: O( n + k log2(n) )
    Average branch depth range: 5.3 – 8.3
  • Weighted quick-union with path compression
    Complexity of k searches on n items: O( n + k log*2(n) )
    Average branch depth: 2.01 (variations at the third decimal)

Path compression clearly minimizes the overhead associated with finding the set label of large collections of items.
Simple code leading to very deep mathematics and enabling huge-scale computation .. my appetite, on the other hand, is growing quickly.

Advertisements

One Response to “Union-find Algorithms”

  1. […] use two lists: L is a pointer array as in the previous post and O contains the state tags (‘#’ for closed, ‘.’ for open site). There […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: