Git and Scientific Reproducibility

I firmly believe that scientists and engineers—particularly scientists, by the way—should learn about, and use, version control systems (VCS) for their work. Here is why.

I’ve been a user of free VCSs for a while now, beginning with my first exposure to CVS at CERN in 2002, through my discovery of Subversion during my doctoral years at EPFL, culminating in my current infatuation with Git as a front-end to Subversion. I’m now a complete convert to that system and could not imaging working without it. Every week I discover new use cases for this tool that I had not thought about before (and that I suspect the Git developers didn’t, either).

This week I found such a use case for Git: enforcing scientific reproducibility. Let me explain. I’m currently working on prototype software written in MATLAB that implements some advanced algorithms for the smart, predictive control of heating in buildings. As part of that work we need to evaluate several competing algorithm designs, and try out different parameters for the algorithms.

The traditional way of doing this is, of course, to set all your parameters right in your code for the first simulation, to run it, then to set the parameters right for the second one, to run it again, and so on. There are several problems with this approach.

First, you need a really good naming convention for the data you are going to generate to make sure that you know exactly which parameters you set for each run. And coming up with a good naming scheme for data files is not trivial.

Second, even if your data file naming convention is good enough that you can easily reproduce the experiment, how can you be sure that the settings are exactly right? That you didn’t, perhaps, tweak just that little extra configuration file just to work around that little bug in the software?

Third, how will you reproduce those results? Even assuming that you ran all your simulations based on a given, well-known revision number in your VCS (you do use a VCS, don’t you?), you will still need to dive in the code and set those configuration parameters yourself. A tedious, error-prone process, even if you manage to keep them all to one source file.

I think a system like Git solves all these problems. Here is how I did it.

I needed to run 7 simulations with different parameters, based on a stable version of our software, say r1409 in our Subversion repository.

I’m using Git as a front-end to Subversion. I began by creating a local branch (something Git, not Subversion, will let you do):

$ git checkout -b simulations_based_on_r1409

This will create a new branch from the current HEAD. Now the idea is to make a local commit on that local branch for each different set of parameters. Here is how:

  1. Edit your source code so that all parameters are set right.
  2. Commit the changes on your local branch:
    $ git ci -am "With parameter X set to Y"
    [simulations_based_on_r1409 66cea68] With parameter X set to   
  3. Note the 7 characters (66cea68 above) next to the branch name. These are the first 7 characters of the SHA-1 hash of your entire project, as computed by Git.
  4. Run your simulation. Log the results, along with the short hash.
  5. Repeat the steps above for each different configuration you want to run the experiment with.

By the end of this process, you should have in your logbook a list of experimental results along with the short hash of the entire project as it looked during that experiment. It might, for instance, look something like this:



















Hash Parameter X Parameter Y Result
66cea68 23 42 1024
a4f683f etc etc etc

As you can see there are at least two reasons why it’s important to record the short hash:

  1. It will let you go back in time and reproduce an experiment exactly as it was when you ran it first.
  2. It will force you to commit all changes before running the experiment, which is a good thing.

I’ve been running a series of simulations using a variation on this process, whereby I actually run several simulations in parallel on my 8-core machine. For this to work you need to clone your entire project, once per simulation. Then for each simulation you checkout the right version of your project, and run the experiment.

Quite seriously, I would never have been able to do anything remotely like this with a centralized version control system. The possibility to create local branches and to commit to them is a truly awesome feature of distributed version control systems such as Git. I don’t suppose the Git developers had scientists and engineers in mind when they developed this system, but hey, here we are.

Are you a scientist or an engineer wishing to dramatically improve your way of working? Then run, do not walk, to read the best book on Git there is.

Thou shalt save energy

I’m not sure anyone else is saying this, so I will: I think **there is
a clear and unambiguous scriptural mandate to reduce our current
energy consumption**.

Now before you dismiss this post, this author and this blog as just
another bible-thumping fanatic, remember that in certain countries,
certain political parties profess strict adherence to Scripture while
being overtly skeptical about the whole climate warming problem. I
think they are wrong and here’s why.

Let’s first review why, from a scriptural point of
view, one could in principle argue that whether we take action or let things run
as they are makes no difference. I’ve heard some of these arguments from very good
(christian) friends, and I hope I’m not going to offend anyone by
refuting them later in this post:

* Revelation 21:1 tells us that all of creation will eventually be
destroyed and replaced by a new one:

Then I saw a new heaven and a new earth; for the first heaven and the first earth passed away, and there is no longer any sea.

* God is sovereign, so no matter what we do, things will run according
to His will.

* It is highly arrogant for Man to believe that they can do anything
about the climate.

The last two arguments are probably the easiest to answer. God is
certainly sovereign, but that doesn’t remove our responsibility for
doing the right things and making the right choices in life. In fact,
God intends us to be co-creators with Him and to participate, so to
speak, in the creative act. This point has been persuasively argued
for by several authors such as C.S. Lewis and J.I. Packer.

The big problem with the first argument is that, even though the
current creation is indeed doomed in the long run, God asked us from
the beginning to take care of it, cf. Genesis 1:28:

God blessed them; and God said to them, “Be fruitful and
multiply, and fill the earth, and subdue it; and rule over the fish of
the sea and over the birds of the sky and over every living thing that
moves on the earth.”

See that? Commandment nr 3 in the whole Bible: subdue the Earth and
rule over it. No mention here of letting things run its course simply
because the creation is about (in the biblical perspective) to be
replaced.

Ah but you might argue that this command was given *before* the Fall,
and that everything went downhill since then. You’re right about the
downhill part, but look at this, Gen 3:23:

therefore the Lord God sent him out from the garden of
Eden, to cultivate the ground from which he was taken.

Man is kicked out of the Garden of Eden, and what is he to do?
Essentially the same thing, e.g. rule over the Earth and cultivate it
and take care of it. The only difference being, of course, that now
it’s going to be painful to do so (Gen 3:19).

The mandate to take care of creation is repeated several times, for
instance right after Noah comes out of the Ark after the Flood, Gen
9:1-2:

And God blessed Noah and his sons and said to them, “Be
fruitful and multiply, and fill the earth. The fear of you and the
terror of you will be on every beast of the earth and on every bird of
the sky; with everything that creeps on the ground, and all the fish
of the sea, into your hand they are given.”

Or of course Psalm 8:5-6:

Yet You have made him a little lower than God,
And You crown him with glory and majesty! You make him to rule over
the works of Your hands; You have put all things under his
feet.

God intends us clearly to rule and manage His creation, no matter what
ultimate fate awaits it.

But the real suprise came to me while re-reading the following
(Deut. 22:6-7):

“If you happen to come upon a bird’s nest along the way,
in any tree or on the ground, with young ones or eggs, and the mother
sitting on the young or on the eggs, you shall not take the mother
with the young; you shall certainly let the mother go, but the young
you may take for yourself, in order that it may be well with you and
that you may prolong your days.

This was one of the many commands given Israel before entering the
promised land. The spirit of this passage, and of others like it, is
unambiguous: God is asking us simply to be **utmostly careful in managing our
natural resources**. We are forbidden to view Earth as just a vast source of
riches to be exploited as quickly and efficiently as possible. We are
explicitly commanded to make sure that Earth can go on being such a
source of riches, indefinitely if needed.

(I’ve even read somewhere that the number of years Israel spent in
babylonian captivity, 70, corresponds to the number of years the land
should have been allowed to rest since Israel took possession of it,
but didn’t. Here again, the importance of allowing natural resources
to replenish themselves is evident.)

The concept of “rest” is a potent one in Scripture. We are to rest
once a week. The land was to rest once every 7 years. We were supposed
to leave alone the corners of our fields that weren’t harvested. In
other words, **Scripture is full of passages mandating a careful
management of our natural resources**. Arguing that we can do whatever
we want with Earth simply because it is doomed to the eternal fire
anyway is not only lazy and criminal, it is also doctrinally false.