deploy codes fast

This post will recap a few deployment strategies necessary for staying sane when working with highly dependent codes. The methods I’m describing here are part of my standard operating procedure (SOP) for training new students and collaborators. Even though these lessons apply best to my biophysics research they are also broadly applicable for anybody who uses linux, python, and other common programming tools. My advice is easy to summarize. You should only install new codes when absolutely necessary; only change your $PATH in predictable ways; and, use a makefile trick for rapidly constructing a command-line interface to python for basic scripting or calls to more elaborate modules.

Begin at the terminal

Computer simulations rely on highly specialized tools thanks to a number of pressures that select against slick, simple, intuitive interfaces. Basic science research is typically conducted in small groups, and as a result, there is little room in the labor or dollar budget for making the tools easy to use. The winding course of a typical research project also necessitates many ad hoc tools with very low initial costs. Methods which are easy to change are therefore easy to fix and improve, whereas standardized tools come with constraints and inevitable information loss. Researchers linger far less on methodological questions when they have access to readymade tools, and in some cases, this can cause some methods to become entrenched even if they don’t work correctly. Developing expendable tools, at least at first, is necessary to manage the risks that come with trying something new.

The high cost of designing usable interfaces means that most of my day-to-day work happens at a terminal, specifically a BASH prompt on a linux machine. The worst you can say about a terminal is that it’s a stickler for correct grammar and basically incapable of making a mistake. This makes it a nearly perfect tool for communicating with a machine. Terminals are some of the oldest application programming interfaces (APIs) and benefit from a long history of standardization (e.g. the POSIX family of standards).

Evolving code

Terminals are both precise and simple, which makes them appealing interfaces for our codes since we don’t have to spend a lot of time building or troubleshooting the interface. But most of our codes are highly dynamic, changing from day to day as we develop them. We use a second tool, a version control program called git, to do the thankless work of remembering the complete history of a code. Before I discovered version control, I would make countless copies of a single code as I added features and tested them. There’s no judgment here – this method works fine for small projects! Git helps by automatically remembering changes to your code. It never discards old versions of code that you have “committed” to the historical record, and this allows you to go back in time to look at older versions. Even if you don’t spend time reading your old code, it liberates you from worrying about saving your half-baked ideas. Interested users should check out the tutorials.

Git also provides a less obvious, but extremely important benefit: it makes it easy to share codes. Useful codes can be easily distributed over the web with services like github. Git repositories are also extremely portable over ssh connections and the program has many tools that make it easy for many users to contribute to a single code. I publish working codes to github and exchange private repositories between many machines during development.

Installation madness

Using both a terminal and a version control program gives you a standard way to use, remember, and exchange codes. This standardization is the most useful when distributing codes to new users. But even this is not quite enough. Each new user must be able to run the code as well. And yet, most of my codes depend on libraries and executables which are clusmy to distribute.

To make sure that new users can run my codes, I have helped develop a simple tool that uses Anaconda (or alternately virtualenv) to install the python dependencies necessary to analyze molecular simulations. You can install this easily on your own, but I keep a list of my favorite dependencies in the factory, which installs the virtual environment with a (lengthy) one-liner. Using a virtual environment to handle a complex set of python dependencies has three huge advantages. (1) Mistakes can be erased without destroying any system software. (2) It’s easy to ensure new users have all of the right software, thereby avoiding dependency hell. (3) Anaconda provides software that sometimes requires sudo to install, since it carefully manages a pseudo-sandbox for compiled codes.

There are certainly more powerful, properly sandboxed virtualization methods, most notably Docker. We use these to test code on different versions of linux. Anaconda is a happy medium, because it means I only have to “install” two programs: an integrator like GROMACS to run molecular dynamics, and a copy of the factory factory to analyze them (and visual molecular dynamics for good measure).

These austere methods help prevent the cascade of software failures that becomes increasingly likely when your codes are too dependent on fragile or evolving software packages.

Overloading the makefile

So, whenever possible, I use python packages installed via the factory to analyze my simulations. Thanks to Anaconda, these packages can link to other compiled codes which require more elaborate libraries. A good example of this is mayavi which uses vTk for 3D visualization. Anaconda installs mayavi without superuser permissions, which is a major advantage for users who don’t wish to install system-wide packages.

Most of my codes find the simulator (e.g. GROMACS) in the path because I also install it in a central place. Alternately, I always source the Anaconda environment before doing any heavy-duty coding that requires python or related codes. Everything else, in particular the relatively small codes that I develop, needs a command-line interface that is easy to find.

The most typical solution to this problem is to develop an “argument parser” for each program. There are many libraries for doing this, from Python’s argparse to BASH scripts that read the command-line. However, since interacting with code takes so many different forms, I’ve started to use a simple makefile to route all commands to make into python libraries in a manner which mimics the POSIX-style syntax used by many commands that run from the terminal. This makefile is used in several of the codes that I contribute to, including factory, cassette, automacs, and omnicalc. Here’s how it works.

Whenever you run make, all of the subsequent arguments are parsed by python. You can use single arguments like make clean or keyword arguments like make upload path="/share/simulations". The first argument is the name of a python function which can be found in any number of target scripts, while everything else is parsed.

To use this trick you only need to include makefile and makeface.py from any of the repos listed earlier (e.g. factory and cassette). This trick has two useful features. First, it means that you can easily add a new user-facing function to the terminal by writing a python function. Second, since the python code makeface.py routes your command-line arguments to python *args and **kwargs, you can ensure that user commands are well-formulated, and produce a helpful error message that makes it easy for them to fix mistakes. The following output tells factory users which commands are available; each function can be located in a different file.

[STATUS] available make targets: locate ps shutdown nuke run help setup set init show_running_factories testcluster renew prepare_server connect template test testdocker unset

┏━━━━━━━━━━━━┓
┃make targets        ┃
┗━━━━━━━━━━━━┛
│
├──connect
├──help
├──init
├──locate
├──nuke
├──prepare_server
├──ps
├──renew
├──run
├──set
├──setup
├──show_running_factories
├──shutdown
├──template
├──test
├──testcluster
├──testdocker
└──unset

[USAGE] `make <target> <args> <kwarg>="<val>" ...`
[STATUS] done

Takeaways

In this post, I’ve listed the primary tools of my trade (namely BASH, git, Anaconda, and the makefile trick). My deployment strategy can be summarized in a short list.

I always try to write robust code (obviously). I test them on several different machines, or in a minimal Docker with an operating system like Debian in case they depend on stock packages in my favorite distribution (SuSE).
Each machine I use has a central copy of the factory with an environment which I alias so I can load it whenever I need a fully featured python environment.
Packages I use every day are not “installed” but simply cloned on demand. I use the make trick to access their underlying python functions and easily add new functions.
I only compile or install new codes when absolutely necessary. I save the installation instructions, and I test them out on new machines and new trainees.

This method maximizes the probability that my codes will deploy for other users on different machines. When I write small, custom codes, they stay current when I push changes to a central git repository, otherwise they are not “installed” in the usual way. Dependencies and the python environment are handled by Anaconda, and I access them with a single alias to a source command in my user profile.

Much of this strategy is either obvious to experienced programmers, or insane to experienced programmers. After all, you can’t teach taste! But you can avoid wasting hours and hours of labor by haphazardly installing and linking to other codes. If you want to make reproducible codes or easily train new researchers, you need a strategy like this one to ensure that you can deploy on new systems.