Academic publishers behave more like libraries (hosting knowledge, curating collections).
All the _actual work_ (intellectual/experimental, writing, proofreading, peer review, typesetting) is done on a voluntary basis by mostly tax-funded academics. Therefore publishers should

0. die in a fire if unwilling to change, become tax-funded public institutions otherwise
1. provide free universal access to all publications.
2. OR, but it’s a mutually exclusive scenario, publishers start _paying their suppliers_, like everyone else.

There, I said it.

You know why this doesn’t happen? Because academia is an ego- and jealousy-driven enterprise, and branding one’s work under prestigious logos is the only tangible* metric of success most academics can aspire to. We are nothing but neurotic shaved monkeys, deal with it.

edit: I’d like to deconstruct what I wrote above: is it any true? and if that’s the case, does it necessarily have to be so? I.e. can this be turned into a positive statement; what drives academia (I’m referring to its research aspect ony; let’s leave education aside fttb) and why? To drive the human spirit forward by expanding knowledge and insight into the workings of the tangible (or intangible? here’s looking at you, theoreticians) world. To form the people who do so into heralds of positive change.

What do paywalled journals have to do with this? Why do we accept being reduced to currency, by an unfair economic lock-in mechanism? (This is what makes us neurotic, I think …)

* “impact factors” are b.s. numbers invented by pointy-haired management to rank clouds and solar flares by prettiness. Research is “invaluable” in the sense that 0. money goes in 1. nothing comes out, i.e. any given publication has measure zero in terms of immediate usefulness. A single paper is NOT worth 25$ of taxes or of someone’s attention, but it’s only worth in the context of all others (at best**) and all human knowledge in an extended sense.

What’s the origin, the source of “prestigiousness” for a journal? It’s a sort of self-fulfilling prophecy in which one’s work gains value purely by proximity to other “prestigious things”, think of it as a halo.
Sure, publishers contend that the curation process is expensive, but I’m pretty confident they have huge operation margins. Need to see the numbers though.
edit: the citation graph is what counts, in a very literal sense. One way to make sense of this growing amount of literature is to keep track of the “hubs”, i.e. the highly connected-to nodes: the “most influential” works are recognized by this. Under “fairness” assumptions which might not always hold; i.e. excluding or including a citation has many psychological hooks that I don’t dare to fully expand on here.. but the most obvious nonlinearity in the citation graph is the author being aware of (or not) of a certain work.

Truth be said, an increasing number of for-profit publishers are graciously giving an “open access” option to authors. At a charge, of course. Between 1K and 1.5K€ per article. Do you recognize this pattern? We’re getting s+++++d big time and have to say thank you as well !

(**) OTOH, there is such a thing as a b.s. publication, with 0 value, period. The price tag of any given paper only explains this vast, semi-invisible, mass of b.s. clogging hard-drives everywhere.

So let’s stop burying our research behind paywalls, break the addiction chain, do some actual good and open-source everything.

Instead, we’re stuck in a pusher-junkie situation, in which the substance is peer recognition, “visibility”. Immeasurable at best. Don’t you hate this state of things? Well, I do.

If you’re wondering what’s the cost of storage and infrastructure, Google rates are close to 2 dollar CENTS per GB per month. A color pdf with plenty of data inside is say 0.5 MB.
Say we define a “relevance lifetime” of a publication, let’s say 10 years (wild guess; it can be 1 month for biology, 50 years for civil engineering).
The hosting cost of a single pdf under these assumptions becomes 0.001171875 USD, ONE TENTH OF A CENT.
So I’d dare say those publication prices are more to justify Springer’s golden armchairs and infant’s blood fountains than actual data hosting.

OTOH, you can never be sure about the future relevance of an article. A “relevance lifetime” could be a loose assumption, and we should never disregard or delete a paper just because of its age. However, it becomes increasingly “better known” (on average), so any market value we attach to it should decrease.

Coordinating peer review has a price, too.I.e. calling those _volunteers_ and making them work faster. Automated reminders. The end.

I’ve completely avoided the problem of interpretation and information context so far. Value is subjective, but we have to deal with very objective monetary cost.
To a non-specialist, a Nature paper is worth exactly 0, apart from the pretty-picture “ooh!” value.
To a specialist instead, what lies inside is not pure information, because of the interpretative “line noise” introduced by natural language.
Raw numerical data too has a context-dependent value; let that sink in. No two persons share the same “universe of discourse”, the “set of all possible messages” introduced by C. Shannon. So how do we quantify the value of this? By the average number of “a-ha” moments over all readership?

But I’m digressing. Academic publishers are a legalised scam, and we should stop foraging them.

While virtualenv (VE) is a very valuable tool, one quickly realizes that there might be a need for some usability tweaks.

Namely, activating and deactivating a VE should be quick and intuitive, much in the same way as any other shell commands are.

Enter (*drumroll*) virtualenvwrapper. This tool allows you to create, activate and deactivate and remove VEs with a single command each.

  • mkvirtualenv
  • workon : if you switch between independent Python installations, workon  lets you see the available VEs and switch between them, rather than deactivating one VE and activating the next one.
  • rmvirtualenv

Very handy.

After installation of VEW, we need to set up a couple of environment variables in our .bashrc or .profile file, and then we’re good to go.

Physically, the VEs created with VEW all reside in a single folder, which should be hidden from regular usage by e.g. giving it a dotted name ( e.g. ~/.virtualenvs ). This effectively hides the VE engine room details from sight, so developers can better focus on the task at hand (or, as someone says, this tool “reduces cognitive load”).

You can also find a convincing screencast here.

So go ahead and try it !

Hello there Internets! So you’re starting up with python for data analysis and all that, yes?

Here I outline the installation steps and requirements for configuring a python library installation using virtualenv and pip that can be used for scientific applications (number crunching functionality i.e. linear algebra, statistics .. along with quick plotting of data etc.).

Python tends to have somewhat obscure policies for library visibility, which can be intimidating to a beginner. Virtualenv addresses these concerns and allows to maintain self-contained python installations, thus simplifying maintenance. It amounts to a number of hacks (with a number of caveats described here), but I find it to be very effective nonetheless, if you really need Python libraries in your project. In particular, it saved me from Python Package Hell, and I hope it will streamline your workflow as well.

I do not assume much knowledge on the part of the reader, however you are welcome to ask for clarifications in the comments and I’ll reply ASAP. In this tutorial we address UNIX-like operating systems (e.g. Linux distributions, OSX etc.). The tags delimited by angular brackets, <> are free for the user to customize.

1) virtualenv : First thing to install. (If you have already installed it, skip to point 2).

Do NOT use the system Python installation, it leads to all sorts of inconsistencies. Either

  • pip install virtualenv

OR “clone” (make a local copy) the github repository

2) create the virtualenv in a given directory (in this example the current directory, represented by . in UNIX systems):

  • virtualenv .

This will copy a number of commands (e.g. python, pip), configuration files and setup environment variables within the <venv> directory.

Alternatively, the virtualenv can be made to use system-wide installed packages with a flag. This option might lead to inconsistencies. Use at own risk:

  • virtualenv –system-site-packages .

3) Activate the virtualenv, which means parsing the activate script:

  • source /bin/activate

As a result of this step, the shell prompt should change and display (<venv>) 

4) Test the virtualenv, by verifying that pip and python refer to the newly-created local commands:

  • which pip
  • which python

should point to a /bin directory contained within the current virtualenv.

When you are done using the virtualenv, don’t forget to deactivate it. If necessary, rm -rf <venv> will delete the virtualenv, i.e. all the packages installed within it etc. Think twice before doing this.

5) Install all the things!

From now on, all install commands use pip, i.e. have the form pip install <package> , e.g. pip install scipy :

scipy (ships with numpy, so it is fundamental)

pandas (various helper functions for numerical data structures)

scikit-learn (machine learning libraries, can be handy)

matplotlib (plotting functions, upon which all python plotting is build)

pyreadline for tab completion of commands

Additionally

ipython, esp. with browser-based notebooks. The install syntax will be

  • pip install “ipython[notebook]”

bokeh (pretty plots)

ggplot for those who need R-style plots. The requirements for ggplot are

  • matplotlib, pandas, numpy, scipy, statsmodels and patsy

6) Develop your scientific Python applications with this powerful array of technologies

7) Once you’re ready to distribute your application to third parties, freeze its dependencies using pip. This is another hack, but hey, we’re in a hurry to do science right? The following two statements represent the situation in which one needs to install the dependencies on a second computer, account or virtualenv.

  • pip freeze > requirements.txt
  • pip install -r requirements.txt

That’s it for now; comment if anything is unclear, or if you find errors, or would like to suggest alternative/improved recipes.

Ciao!

Visual poetry

December 21, 2013

“It is said that paradise with virgins is delightful, I find only the juice of the grape enchanting! Take this penny and the let go of a promised treasure, because the war drum sound is exhilarating only from a distance.” — Omar Khayyam (1048-1131), Iranian polymath and poet

The above is an example of a Ruba’i, a traditional Persian form of quatrain poetry. I find it beautiful on so many levels.

The size of what can be known

December 15, 2013

The Planck length is estimated at 1.616199(97) \times 10^{-35} meters, whereas the radius of the observable Universe (comoving distance to the Cosmic Microwave Background) is 46.6 \times 10^{9} light years, i.e. 4.41 \times 10^{26} meters. 

Both represent the metric limits of what we can perceive, regardless of the observation technique: the Planck length corresponds to the smallest measurable distance, whereas the observable radius of the Universe corresponds to the most ancient observable radiation (the CMB is the redshifted light emitted at the end of the Inflationary Epoch).

The existence of universes whose Planck length is larger than the observable size of ours (or, whose universe is bounded by our Planck length) is not provable. A fractal nesting of turtles.

  • In case you manage to break your installation so badly that it won’t move much past the bootloader (say, when the contents of  /etc/init/ are read .. runlevel 2?), you may want to modify the boot command line in order to gain shell access (choose it and press e), by appending init=/bin/bash at the end of the line starting with ‘linux’.
  • Your HD might be mounted in read-only mode at this stage, so you might want to remount it in read-write mode, like so: first note down the device name (e.g. /dev/sda3 ), and then call mount with the appropriate remount options: mount -o remount,rw /dev/sda3 .
  • You might need to know a couple of vi commands, in order to edit the relevant configuration files (pray that you know what you’re doing). For example, i enters ‘insertion’ mode, x deletes a character, ESC returns vi in command mode at which point you can either close without saving with :q! or after saving with :wq
  • Ubuntu 12 waits until the network interfaces listed in /etc/network/interfaces are brought up (see man ifup). One can override this setting by replacing start on (filesystem and static-network-up) or failsafe-boot with: start on (filesystem) or failsafe-boot in /etc/init/rc-sysinit.conf .
  • In general, having a proven working operating system on another partition/neighboring PC helps a lot. Happy sysadministration!

Area were an Italian progressive rock band, until the premature death of their frontman, Demetrio Stratos.
I find the sheer creativity and sense of freedom conveyed by their sound to be so refreshing.

Below, one of their albums. Enjoy!

The first assignment for Algorithms 1 is an estimation for the percolation threshold in an n-by-n array of sites; the system is said to percolate whenever there exists an uninterrupted series of connections between the top “edge” and the bottom one.

We need a fast lookup for the neighborhood of any given site; the neighbors() method checks whether a site is in the middle or on the sides or corners of the grid:

def neighbors(ii, n):
    "returns list of neighbors of ii in n-by-n grid"
    l = []
    if mod(ii, n) > 0:
        l.append(ii-1)
    if mod(ii-n+1, n) > 0 or ii == 0:
        l.append(ii + 1)
    if ii > n-1:
        l.append(ii - n)
    if ii < n*(n-1):
        l.append(ii + n)
    return l

I use two lists: L is a pointer array as in the previous post and O contains the state tags (‘#’ for closed, ‘.’ for open site). There are two extra sites, serving as roots for the top, resp. bottom sites.

def percolation_frac(L, O, sites):
    for ii, isite in enumerate(sites):
        O[isite] = '.'
        if isite < n:
            L[isite] = ridxTop
        elif n*(n-1) < isite <= N:
            L[isite] = ridxBot
        neighb = neighbors(isite, n)
        for inb in neighb:
            if O[inb]=='.':
                L = wqupc2(L, isite, inb)                
        if L[-1]==L[-2]: 
            """ percolation: 
            top and bottom virtual sites 
            belong to same component
            """
            pf = float(ii) / N
            return pf

As there is no analytical means of predicting when the percolation phase transition is going to occur, we just run the experiment in Monte Carlo mode (many times with randomized initialization), and average the per-run percolation threshold estimate.
Theory tells us that in a square lattice, this value lies around 0.5927 .
My implementation bumps into a few false positives, possibly due to the order in which the union operation is performed, thereby skewing the average more toward 0.6something. Need to look into this.