While virtualenv (VE) is a very valuable tool, one quickly realizes that there might be a need for some usability tweaks.

Namely, activating and deactivating a VE should be quick and intuitive, much in the same way as any other shell commands are.

Enter (*drumroll*) virtualenvwrapper. This tool allows you to create, activate and deactivate and remove VEs with a single command each.

  • mkvirtualenv
  • workon : if you switch between independent Python installations, workon  lets you see the available VEs and switch between them, rather than deactivating one VE and activating the next one.
  • rmvirtualenv

Very handy.

After installation of VEW, we need to set up a couple of environment variables in our .bashrc or .profile file, and then we’re good to go.

Physically, the VEs created with VEW all reside in a single folder, which should be hidden from regular usage by e.g. giving it a dotted name ( e.g. ~/.virtualenvs ). This effectively hides the VE engine room details from sight, so developers can better focus on the task at hand (or, as someone says, this tool “reduces cognitive load”).

You can also find a convincing screencast here.

So go ahead and try it !


Hello there Internets! So you’re starting up with python for data analysis and all that, yes?

Here I outline the installation steps and requirements for configuring a python library installation using virtualenv and pip that can be used for scientific applications (number crunching functionality i.e. linear algebra, statistics .. along with quick plotting of data etc.).

Python tends to have somewhat obscure policies for library visibility, which can be intimidating to a beginner. Virtualenv addresses these concerns and allows to maintain self-contained python installations, thus simplifying maintenance. It amounts to a number of hacks (with a number of caveats described here), but I find it to be very effective nonetheless, if you really need Python libraries in your project. In particular, it saved me from Python Package Hell, and I hope it will streamline your workflow as well.

I do not assume much knowledge on the part of the reader, however you are welcome to ask for clarifications in the comments and I’ll reply ASAP. In this tutorial we address UNIX-like operating systems (e.g. Linux distributions, OSX etc.). The tags delimited by angular brackets, <> are free for the user to customize.

1) virtualenv : First thing to install. (If you have already installed it, skip to point 2).

Do NOT use the system Python installation, it leads to all sorts of inconsistencies. Either

  • pip install virtualenv

OR “clone” (make a local copy) the github repository

2) create the virtualenv in a given directory (in this example the current directory, represented by . in UNIX systems):

  • virtualenv .

This will copy a number of commands (e.g. python, pip), configuration files and setup environment variables within the <venv> directory.

Alternatively, the virtualenv can be made to use system-wide installed packages with a flag. This option might lead to inconsistencies. Use at own risk:

  • virtualenv –system-site-packages .

3) Activate the virtualenv, which means parsing the activate script:

  • source /bin/activate

As a result of this step, the shell prompt should change and display (<venv>) 

4) Test the virtualenv, by verifying that pip and python refer to the newly-created local commands:

  • which pip
  • which python

should point to a /bin directory contained within the current virtualenv.

When you are done using the virtualenv, don’t forget to deactivate it. If necessary, rm -rf <venv> will delete the virtualenv, i.e. all the packages installed within it etc. Think twice before doing this.

5) Install all the things!

From now on, all install commands use pip, i.e. have the form pip install <package> , e.g. pip install scipy :

scipy (ships with numpy, so it is fundamental)

pandas (various helper functions for numerical data structures)

scikit-learn (machine learning libraries, can be handy)

matplotlib (plotting functions, upon which all python plotting is build)

pyreadline for tab completion of commands


ipython, esp. with browser-based notebooks. The install syntax will be

  • pip install “ipython[notebook]”

bokeh (pretty plots)

ggplot for those who need R-style plots. The requirements for ggplot are

  • matplotlib, pandas, numpy, scipy, statsmodels and patsy

6) Develop your scientific Python applications with this powerful array of technologies

7) Once you’re ready to distribute your application to third parties, freeze its dependencies using pip. This is another hack, but hey, we’re in a hurry to do science right? The following two statements represent the situation in which one needs to install the dependencies on a second computer, account or virtualenv.

  • pip freeze > requirements.txt
  • pip install -r requirements.txt

That’s it for now; comment if anything is unclear, or if you find errors, or would like to suggest alternative/improved recipes.


Recursive algorithms

September 14, 2013

A recursive algorithm calls itself with a subset of the initial data until its base case is reached (e.g. data cannot be split further).

The canonical example is the factorial operation: x! = x (x-1) (x-2) \cdots 2 .
A straightforward Python function that implements this idea follows:

def fact(x):
    if x <= 1:
        return 1
        return x*fact(x-1)

In practice, recursion actually can speed up the execution of divide-and-conquer (DNQ) algorithms (i.e. in which the initial problem is sequentially split in more manageable sub-problems), as discussed in the next section.

Analysis: the Master Theorem
An asymptotic estimate of the running time T of a recursive DNQ algorithm with deterministically equal size subproblems is given by the following formula: \displaystyle T(n) \leq a T(\frac{n}{b}) + O(n^d), where n is the data size, a \geq 1 and b >  1 are the number of recursive calls per level (i.e. the branching factor of the recursion tree) and the associated reduction in input size, and O(n^d) describes the computational effort associated with preprocessing (i.e. all that is done before the recursions begin).
One can then identify three cases (for simplicity, n is assumed to be a power of b):

  • If a = b^d then T(n) = O(n^d \log n)
  • If a < b^d then T(n) = O(n^d)
  • If a > b^d then T(n) = O(n^{\log_b a})

The first assignment for Algorithms 1 is an estimation for the percolation threshold in an n-by-n array of sites; the system is said to percolate whenever there exists an uninterrupted series of connections between the top “edge” and the bottom one.

We need a fast lookup for the neighborhood of any given site; the neighbors() method checks whether a site is in the middle or on the sides or corners of the grid:

def neighbors(ii, n):
    "returns list of neighbors of ii in n-by-n grid"
    l = []
    if mod(ii, n) > 0:
    if mod(ii-n+1, n) > 0 or ii == 0:
        l.append(ii + 1)
    if ii > n-1:
        l.append(ii - n)
    if ii < n*(n-1):
        l.append(ii + n)
    return l

I use two lists: L is a pointer array as in the previous post and O contains the state tags (‘#’ for closed, ‘.’ for open site). There are two extra sites, serving as roots for the top, resp. bottom sites.

def percolation_frac(L, O, sites):
    for ii, isite in enumerate(sites):
        O[isite] = '.'
        if isite < n:
            L[isite] = ridxTop
        elif n*(n-1) < isite <= N:
            L[isite] = ridxBot
        neighb = neighbors(isite, n)
        for inb in neighb:
            if O[inb]=='.':
                L = wqupc2(L, isite, inb)                
        if L[-1]==L[-2]: 
            """ percolation: 
            top and bottom virtual sites 
            belong to same component
            pf = float(ii) / N
            return pf

As there is no analytical means of predicting when the percolation phase transition is going to occur, we just run the experiment in Monte Carlo mode (many times with randomized initialization), and average the per-run percolation threshold estimate.
Theory tells us that in a square lattice, this value lies around 0.5927 .
My implementation bumps into a few false positives, possibly due to the order in which the union operation is performed, thereby skewing the average more toward 0.6something. Need to look into this.

Union-find Algorithms

August 25, 2013

Started following Coursera-Princeton’s Algorithms 1.
Being a non-computer scientist, I started missing these theoretical foundations in some of my recent scientific work, so this course might fill a few gaps.
Too bad they don’t offer any completion certificate (as other Coursera courses do), therefore I won’t be submitting the programming exercises (Java, btw).
Instead, I’ll work on the assignments in Python, to my heart’s content.

We start from the very top: n objects belonging to m \leq n disjoint sets (“connected components”, in graph terms).
Often one wishes to query whether two items belong to the same set and, in case, merge two sets.
If sets are represented as trees, a common tree root node indicates that two given items belong to the same set.

Tree root-find with recursion
Data structure: L is a pointer list, a list of integers which can be traversed recursively to find the root (i.e. root of element i is L[ ... L[ i ] ... ] ):

def root(L,ii):
    "ridx = tree root of element ii"
    "rd = depth of ii"
    ridx = ii
    rd = 0
    if L[ii] != ii:
        ridx, rd = root(L, L[ii])
        rd += 1    
    return ridx, rd

The branch() method outputs the list of nodes connecting ii to its root:

def branch(L,ii):
    "return branch from ii to root as list"
    if L[ii] != ii:
        Lout = Lout + branch(L, L[ii]) 
    return Lout

There is an interesting discussion about recursion in Python and its real implications here, the take-home message being simply that one should not implement iterative procedures (‘loops’) as recursion.

This algorithm needs the items to be arranged in a set of trees (a ‘forest’), so that the set query is simplified to an identity test of the respective tree roots.
Using the code above, this simply reads:

def qu(L, ii1, ii2):
    "attach subtree of ii2 to root of ii1"
    ridx1, rd1 = root(L, ii1)
    ridx2, rd2 = root(L, ii2)
    L[ridx1] = ridx2
    return L

A shortcoming of this approach is that trees tend to get tall and skinny, so k searches in them, in the worst case, require O(k n) operations.

Weighted Quick-Union
The complexity of a search on Union-Find trees can be kept manageable by weighting, i.e. one subtree in WQU is attached to the second’s root if the former’s size is smaller.
A more precise notion of size is in order: e.g. the total number of elements in the subtree or the subtree depth.
If we settle for the latter definition (union-by-depth),

def wqu(L, ii1, ii2):
    "weighted quick union"
    "attach subtree of ii2 to subtree of ii1,"
    "if the root distance of former is smaller"
    ridx1, rd1 = root(L, ii1)
    ridx2, rd2 = root(L, ii2)
    if rd1 <= rd2:
        L[ridx1] = ridx2
        L[ridx2] = ridx1
    return L  

This balancing helps to keep the complexity of a search on L at n+k\log_2(n) (where \log_2(n) is the average number of subtree size doublings).

Weighted Quick-Union with Path Compression
An optimization on WQU is given by “path compression”: at union time, all the parents of the node to be attached are set to be the root node of the larger tree.
A (single-pass) WQUPC code reads:

def wqupc2(L, ii1, ii2):
    "WQU with path compression"
    B1 = branch(L, ii1)
    B2 = branch(L, ii2)
    rd1 = len(B1)
    rd2 = len(B2)
    ridx1 = B1[-1]
    ridx2 = B2[-1]
    if rd1 <= rd2:
        for ii in B1:
            L[ii] = ridx2
        for ii in B2:
            L[ii] = ridx1
    return L   

While doing what it says on the label, the code above uses two recursive calls to find the branch roots (with the known memory and depth size limitations of recursion), storage of the branches and a loop along the smaller of the two branches.
A smarter implementation is clearly necessary for large-scale applications.
If the search tree can be “flattened out” maximally, the complexity of a union search can be further reduced so as to approach (but never reach) the size n of the data items.
That is, k union-find operations on n objects organized in a flattened tree need less than c \left( n + k \log^*_2(n) \right) array accesses, where \log^*() is the iterated log function (Hopcroft-Ullman, Tarjan).
BTW, the iterated logarithm is defined as the number of times \log() has to be applied in order to obtain \leq 1, and is an extremely slowly growing function (e.g. \log^*(2^{2^{16}})\sim 5).

Results and conclusions
We list the theoretical worst-case computational burden, along with typical figures for the average branch depth of the trees generated with each algorithm (obtained with sets of n=1000 integers, after 10^5 random union operations):

  • Quick-union
    Complexity of k searches on n items: O(k n)
    Average branch depth range: 124 – 133
  • Weighted quick-union
    Complexity of k searches on n items: O( n + k log2(n) )
    Average branch depth range: 5.3 – 8.3
  • Weighted quick-union with path compression
    Complexity of k searches on n items: O( n + k log*2(n) )
    Average branch depth: 2.01 (variations at the third decimal)

Path compression clearly minimizes the overhead associated with finding the set label of large collections of items.
Simple code leading to very deep mathematics and enabling huge-scale computation .. my appetite, on the other hand, is growing quickly.