Dear blog, I’ve got the hots for Haskell.
Its terse syntax is intimidating at first, but as you keep exploring, you start seeing its beauty and even possible ways to write a Hello World program.
But let’s start from the easy bits: this language makes it trivial to compose functions, and in a sense specialize the bare language to represent more closely the semantics of the computational problem at hand.

One of the most basic higher order functions is map, which takes a unary function f and a list l and returns a new list, made of mapping f onto every element of l:

map         :: (a -> b) -> [a] -> [b]
map f []     = []
map f (x:xs) = f x : map f xs

Silly examples!

> map (+1) [1,2,3]
[2,3,4]
> map (\c -> if c=='o' then 'i' else c ) "potatoes"
"pitaties"

Similarly, the right ‘fold’ operation (called ‘reduce’ in other languages) foldr takes a binary function f, a starting object v and a list (x:xs) , and maps f across the elements of (x:xs) recursively:

foldr :: (t -> t1 -> t1) -> t1 -> [t] -> t1
foldr f v [] = v
foldr f v (x:xs) = f x (foldr f v xs)

Binary functions? We’re not restricted to arithmetic operations! If we consider f in the example above as the functional composition operator (.), and the starting element v being the identity function id, the Haskell typechecker will infer that foldr (.) id just requires a list of a -> a functions (i.e. that share the same type signature as id, in this case) and an element to start with!
So here’s compose :

compose :: [a -> a] -> a -> a
compose = foldr (.) id

This idea of representing n-ary functions as daisy-chains of unary elements is called “currying“, and is crucial for reasoning about programs, as we can see in the above example.
Functional composition is implemented as follows, where “\u -> .. body .. ” is the idiom for anonymous (lambda) functions of a variable u:

(.) :: (b -> c) -> (a -> b) -> a -> c
(.) f g = \x -> f (g x)

Another thing of beauty is unfold, which serves to build data structures up from a prototype element.
In this case, it’s specialized to the List monad, so the neutral element is the empty list [] and the structure constructor is the list concatenation operation (:).

unfold :: (a -> Bool) -> (a -> b) -> (a -> a) -> a -> [b]
unfold p h t x
  | p x       = []
  | otherwise = h x : unfold p h t (t x)

See the type signature? We need a function from a domain to Booleans (a “decision” function p), two functions h and t that are applied to the head and to the recursion tail, respectively.

Unfold can be generalized to arbitrary recursive structures, such as trees, heaps etc.

As a quick example, a function that takes a positive integer and returns its digits as a list can be very compactly formulated in terms of unfold (however, leading zeroes are discarded):

toDigits :: Integer -> [Integer]
toDigits = reverse . unfold (==0) (`mod` 10) (`div` 10)
> toDigits 29387456
[2,9,3,8,7,4,5,6]

Next up: Monoids! Functors! Monads! “Hello World” ! and other abstract nonsense.

Advertisements
  1. Computer program power consumption
    A programming language that “minimizes” power consumption through minimal interconnect usage (e.g. memory calls).
  2. Food sourcing power consumption
    Farmland supply to cities: how to optimize land usage? What part of the produce can be made local e.g. made at the consumer or turned to hydroponic and its culture brought within the city itself?

Both these problems require a grammar of solutions, rather than a single instance, due to the diversity of the operating/boundary conditions that are encountered.
As such, I don’t think that a “proof of correctness” for either can be hoped for, but perhaps a number of heuristic checks might prove the point.
The former is addressed by a single technology, whereas the second requires a diverse array of strategies.

General considerations

  • Area and land usage
    Arbitrary rearrangement of the resources is not trivial: CPUs are designed with CAD tools that favor periodicity and reuse, and farmland restricts supply due to physiological productivity/rest cycles.
  • Time and flow
    Time plays a part as well: the edges in these supply nets do not handle a constant flow. In the first case, storage is regulated by registers, queues and stacks, whereas in the second, the flowing entities are subject to seasonal variation, degrade with time etc.

This framework is intentionally generic in order to highlight similarities, and it is of course a work in progress.
Both these problems in fact have broad political implications, which leaves plenty of space for many juicy discussions. Looking forward.

Literature

  1. An article from the NYT: A Balance Between the Factory and the Local Farm (Feb. 2010) highlights both the high costs of local (i.e. small-scale) green production, citing The 64$ Tomato, and the related climatic issues (e.g. cultivation on terrain located in the snow belt).
    The article closes with “Localism is difficult to scale up enough to feed a whole country in any season. But on the other extreme are the mammoth food factories in the United States. Here, frequent E. coli and salmonella bacteria outbreaks […] may be a case of a manufacturing system that has grown too fast or too large to be managed well.
    Somewhere, there is a happy medium.” — an optimum, if you will.

Side questions

  • Why do large-scale economics “work better”? i.e. have a larger monetary efficiency, which drives down the prices for the end user? More effective supply chain, waste minimization, minimization of downtime …

Recursive algorithms

September 14, 2013

A recursive algorithm calls itself with a subset of the initial data until its base case is reached (e.g. data cannot be split further).

The canonical example is the factorial operation: x! = x (x-1) (x-2) \cdots 2 .
A straightforward Python function that implements this idea follows:

def fact(x):
    if x <= 1:
        return 1
    else:
        return x*fact(x-1)

In practice, recursion actually can speed up the execution of divide-and-conquer (DNQ) algorithms (i.e. in which the initial problem is sequentially split in more manageable sub-problems), as discussed in the next section.

Analysis: the Master Theorem
An asymptotic estimate of the running time T of a recursive DNQ algorithm with deterministically equal size subproblems is given by the following formula: \displaystyle T(n) \leq a T(\frac{n}{b}) + O(n^d), where n is the data size, a \geq 1 and b >  1 are the number of recursive calls per level (i.e. the branching factor of the recursion tree) and the associated reduction in input size, and O(n^d) describes the computational effort associated with preprocessing (i.e. all that is done before the recursions begin).
One can then identify three cases (for simplicity, n is assumed to be a power of b):

  • If a = b^d then T(n) = O(n^d \log n)
  • If a < b^d then T(n) = O(n^d)
  • If a > b^d then T(n) = O(n^{\log_b a})

Union-find Algorithms

August 25, 2013

Started following Coursera-Princeton’s Algorithms 1.
Being a non-computer scientist, I started missing these theoretical foundations in some of my recent scientific work, so this course might fill a few gaps.
Too bad they don’t offer any completion certificate (as other Coursera courses do), therefore I won’t be submitting the programming exercises (Java, btw).
Instead, I’ll work on the assignments in Python, to my heart’s content.

We start from the very top: n objects belonging to m \leq n disjoint sets (“connected components”, in graph terms).
Often one wishes to query whether two items belong to the same set and, in case, merge two sets.
If sets are represented as trees, a common tree root node indicates that two given items belong to the same set.

Tree root-find with recursion
Data structure: L is a pointer list, a list of integers which can be traversed recursively to find the root (i.e. root of element i is L[ ... L[ i ] ... ] ):

def root(L,ii):
    "ridx = tree root of element ii"
    "rd = depth of ii"
    ridx = ii
    rd = 0
    if L[ii] != ii:
        ridx, rd = root(L, L[ii])
        rd += 1    
    return ridx, rd

The branch() method outputs the list of nodes connecting ii to its root:

def branch(L,ii):
    "return branch from ii to root as list"
    Lout=[ii]
    if L[ii] != ii:
        Lout = Lout + branch(L, L[ii]) 
    return Lout

There is an interesting discussion about recursion in Python and its real implications here, the take-home message being simply that one should not implement iterative procedures (‘loops’) as recursion.

Quick-Union
This algorithm needs the items to be arranged in a set of trees (a ‘forest’), so that the set query is simplified to an identity test of the respective tree roots.
Using the code above, this simply reads:

def qu(L, ii1, ii2):
    "attach subtree of ii2 to root of ii1"
    ridx1, rd1 = root(L, ii1)
    ridx2, rd2 = root(L, ii2)
    L[ridx1] = ridx2
    return L

A shortcoming of this approach is that trees tend to get tall and skinny, so k searches in them, in the worst case, require O(k n) operations.

Weighted Quick-Union
The complexity of a search on Union-Find trees can be kept manageable by weighting, i.e. one subtree in WQU is attached to the second’s root if the former’s size is smaller.
A more precise notion of size is in order: e.g. the total number of elements in the subtree or the subtree depth.
If we settle for the latter definition (union-by-depth),

def wqu(L, ii1, ii2):
    "weighted quick union"
    "attach subtree of ii2 to subtree of ii1,"
    "if the root distance of former is smaller"
    ridx1, rd1 = root(L, ii1)
    ridx2, rd2 = root(L, ii2)
    if rd1 <= rd2:
        L[ridx1] = ridx2
    else:
        L[ridx2] = ridx1
    return L  

This balancing helps to keep the complexity of a search on L at n+k\log_2(n) (where \log_2(n) is the average number of subtree size doublings).

Weighted Quick-Union with Path Compression
An optimization on WQU is given by “path compression”: at union time, all the parents of the node to be attached are set to be the root node of the larger tree.
A (single-pass) WQUPC code reads:

def wqupc2(L, ii1, ii2):
    "WQU with path compression"
    B1 = branch(L, ii1)
    B2 = branch(L, ii2)
    rd1 = len(B1)
    rd2 = len(B2)
    ridx1 = B1[-1]
    ridx2 = B2[-1]
    if rd1 <= rd2:
        for ii in B1:
            L[ii] = ridx2
    else:
        for ii in B2:
            L[ii] = ridx1
    return L   

While doing what it says on the label, the code above uses two recursive calls to find the branch roots (with the known memory and depth size limitations of recursion), storage of the branches and a loop along the smaller of the two branches.
A smarter implementation is clearly necessary for large-scale applications.
If the search tree can be “flattened out” maximally, the complexity of a union search can be further reduced so as to approach (but never reach) the size n of the data items.
That is, k union-find operations on n objects organized in a flattened tree need less than c \left( n + k \log^*_2(n) \right) array accesses, where \log^*() is the iterated log function (Hopcroft-Ullman, Tarjan).
BTW, the iterated logarithm is defined as the number of times \log() has to be applied in order to obtain \leq 1, and is an extremely slowly growing function (e.g. \log^*(2^{2^{16}})\sim 5).

Results and conclusions
We list the theoretical worst-case computational burden, along with typical figures for the average branch depth of the trees generated with each algorithm (obtained with sets of n=1000 integers, after 10^5 random union operations):

  • Quick-union
    Complexity of k searches on n items: O(k n)
    Average branch depth range: 124 – 133
  • Weighted quick-union
    Complexity of k searches on n items: O( n + k log2(n) )
    Average branch depth range: 5.3 – 8.3
  • Weighted quick-union with path compression
    Complexity of k searches on n items: O( n + k log*2(n) )
    Average branch depth: 2.01 (variations at the third decimal)

Path compression clearly minimizes the overhead associated with finding the set label of large collections of items.
Simple code leading to very deep mathematics and enabling huge-scale computation .. my appetite, on the other hand, is growing quickly.