My research is centered on answering biology questions in molecular detail using theory and simulation. In this section I will outline three interconnected tools that make this process easier and accessible to new scientists. BioPhysCode is the trade name for these codes which were co-created with Joe Jordan when we were PhD students advised by Ravi Radhakrishnan.
  1. Automacs is a lightweight framework for building GROMACS simulations.
  2. Omnicalc is a comprehensive analysis platform based on Python and Scipy.
  3. Factory is a front-end for both automacs and omnicalc which is based on Django.

These three codes are distributed under a copyleft license: GPLv2. The remainder of this page outlines these codes, with links to the source and documentation. As with all of my codes, please get in touch with me if you would like help deploying them.

Automatic simulation machines

AUTOmatic simulation MAChineS, or “automacs” for short, is a set of interface codes designed to systematically produce large batches of molecular dynamics simulations. It was designed for the popular GROMACS simulation package but work is currently underway to link it to LAMMPS as well. GROMACS is an extremely automatic already, and I do not wish to imply otherwise. Our codes wrap GROMACS to make our work more standardized and reproducible.

The typical workflow for GROMACS and many other molecular dynamics integrators is centered around running commands at a terminal and editing text files. Beginners typically start with one of many helpful tutorials to learn the myriad details necessary to run a simulation, and as their projects get larger and more complicated, they turn to more automatic tools and write their own programs and scripts to run more simulations.

Automacs is a natural extension of this strategy. It began as a set of BASH scripts for teaching trainees how to make new simulations, and for replicating existing simulations. In the intervening years, we have switched to Python for practical reasons and for access to mathematical libraries, but the underlying principle is the same. Automacs is largely composed of the following elements:

  1. Simulation scripts that run a particular task like simulating a protein in water, a coarse-grained bilayer, or a polymer melt.
  2. A modest framework for organizing variables associated with the models, for example the number of water molecules in your system, the type of bonds in a coarse-grained topology, and the multitude of log files from a normal simulation.
  3. Modular codes for performing common tasks like “equilibration”, arrangement of lipids in a bilayer, or placing a protein on a bilayer.

The point of automacs is to turn artisanal models into reproducible codes. We have designed many of the codes to generalize important parameters so that the various utility functions are reusable. For example, one equilibrate function can be used for many different simulation types, while the bilayer construction routines work equally well for coarse-grained and explicit lipids. Automacs supplies many core functions, but Joe and I have both written libraries specific to certain systems. We share these independently on github. For example, I have written bilayer routines and shared them in the amx-bilayers module. Automacs will automatically import these codes for you. This is a partial solution to the question of permissions; any user can write and share new codes in the “automacs style” without my oversight or approval.

Automacs is also designed to be used in a single pipeline called “the factory” which allows users to simulate and analyze a system in a single graphical interface. Both the factory and automacs are documented on the BioPhysCode portal.

An example “experiment”

Automacs acts as an ad hoc application programming interface (API) and lets you run readable scripts, which we call “experiments” since they should function as computational analogs for a bench experiment. The canonical reference is given in the following script for running an atomistic protein in water. The procedure should be familiar to anybody who runs these simulations. It is designed to resemble a lab notebook.

#!/usr/bin/env python

"""
PROTEIN SIMULATION
Atomistic protein in water.
"""

from amx import *

init()
make_step(settings.step)
write_mdp()
if state.pdb_source: get_pdb(state.pdb_source)
else: get_start_structure(state.start_structure)
remove_hetero_atoms(
	structure='start-structure.pdb',
	out='start-structure-trim.pdb')
gmx('pdb2gmx',
	base='vacuum',
	structure='start-structure-trim.pdb',
	gro='vacuum-alone',
	water=settings.water,
	ff=settings.force_field,
	log='pdb2gmx')
copy_file('system.top','vacuum.top')
extract_itp('vacuum.top')
write_top('vacuum.top')
gmx('editconf',
	structure='vacuum-alone',
	gro='vacuum',
	c=True,d='%.2f'%settings.water_buffer,
	log='editconf-vacuum-room')
minimize('vacuum',method='steep')
solvate_protein(
	structure='vacuum-minimized',
	top='vacuum.top')
minimize('solvate')
counterions(
	structure='solvate-minimized',
	top='solvate',
	ff_includes='ions')
minimize('counterions')
write_structure_pdb(
	pdb='start-structure.pdb',
	structure='counterions')
write_top('system.top')
write_continue_script()
equilibrate()

This code abides by the maxim of “convention over configuration” by abstracting much of the simulation bookkeeping while retaining the important details. We can identify the minimization, the solvation steps, the counterions, and the files used in the production run. The corresponding settings block (part of the same simulation script) is likewise easy to understand. We arrange experiments in files containing a single dictionary literal.

{
'protein':{
#####
####
###
##
#
'tags':['aamd','protein','tested_2017.09.14'],
'script':'protein.py',
'params':'parameters.py',
'extensions':[],
'settings':"""

step: protein                       # name of the folder is s01-protein
force field: charmm27               # which gromacs-standard force-field to use (see pdb2gmx list)
water: tip3p                        # which water model (another question from pdb2gmx)
equilibration: nvt-short,nvt,npt    # which equilibration step to use (must have `input-name-in.mdp` below)
pdb source: 1yrf                    # PDB code for download. overrides the start structure
start structure: none               # path to PDB structure or None to use a single PDB in inputs
protein water gap: 3.0              # Angstroms distance around the protein to remove water
water buffer: 1.2                   # distance (nm) of solvent to the box wall 
solvent: spc216                     # starting solvent box (use spc216 from gromacs share)
ionic strength: 0.150               # desired molar ionic strength
cation: NA                          # name of the cation for neutralizing the system
anion: CL                           # name of the anion for neutralizing the system

# integrator parameters from ./parameters.py
mdp_specs:| {
	'group':'aamd',
	'mdps':{
		'input-em-steep-in.mdp':['minimize'],
		'input-em-cg-in.mdp':['minimize',{'integrator':'cg'}],
		'input-md-nvt-eq-in.mdp':['nvt-protein','nvt-protein',{'nsteps':10000}],
		'input-md-nvt-short-eq-in.mdp':['nvt-protein-short',{'nsteps':10000}],
		'input-md-npt-eq-in.mdp':['npt-protein',{'nsteps':10000}],
		'input-md-in.mdp':{'nsteps':100000},
		},
	}
"""},
}

Locating the simulation settings in a more durable, YAML-style format makes it easy to automate the creation of large batches of simulations, to extend your protocols with new options, and to control the logic of the simulation from outside of the underlying python codes. These settings blocks make it easier to write green-free codes.

Omnicalc

Omnicalc provides a highly extensible analysis pipeline for analyzing or “post-processing” simulations generated by GROMACS. It uses a flexible naming scheme to read in large data sets, even if they were not created with automacs. Omnicalc can be used in several ways: you can analyze new simulations with it, use it to standardize your post-processing workflow so that others may extend or reproduce your calculations, or, you can use it to directly share simulation data sets for further calculation. The code is distributed directly on GitHub but we encourage new users to use the factory to manage multiple omnicalc projects and the associated dependencies.

As with automacs, this code is designed to abstract many of the experiment specific parameters to readable text files. The analysis functions themselves are generally written without hard-coding any of these parameters so that they can be reused when necessary. The system also allows users to arange analysis steps into a workflow which avoids redundant calculations. Omnicalc uses hdf5 to store post-processed data in compact binaries. Many of our example codes also inclue extensive parallelization and database options for developing intricate workflows. In much the same way that automacs “experiments” can be shared in separate modules, so too can users share calculations on github so that others can import them into a copy of omnicalc. This approach is meant to be hands-off and free-form on our part. We want users to approach omnicalc as a tool for running calculations without a lot of overhead.

The following code block provides an example calculation which computes the RMSD of a protein backbone. We use the MDAnalysis package to read GROMACS binary trajectory files and select the backbone carbon atoms and SciPy to align the structures. Calculation attributes are specified in a separate YAML file, which make it easy to perform parameter sweeps. Each result is saved with metadata so that it can be easily retrieved later.

def protein_rmsd(grofile,trajfile,**kwargs):
	"""
	Compute the RMSD of a protein.
	"""
	# unpack
	sn = kwargs['sn']
	work = kwargs['workspace']
	# prepare universe	
	slice_name = kwargs['calc']['slice_name']
	group = kwargs['calc']['group']
	grofile,trajfile = [work.slice(sn)[slice_name][group][i] for i in ['gro','xtc']]
	uni = MDAnalysis.Universe(work.postdir+grofile,work.postdir+trajfile)
	nframes = len(uni.trajectory)
	protein = uni.select_atoms('protein and name CA')
	# reference frame
	uni.trajectory[0]
	r0 = protein.coordinates()
	r0 -= mean(r0,axis=0)
	# collect coordinates
	nframes = len(uni.trajectory)
	coords = []
	for fr in range(0,nframes):
		uni.trajectory[fr]
		r1 = protein.coordinates()
		coords.append(r1)
	# simple RMSD code
	rmsds = []
	for fr in range(nframes):
		status('RMSD',i=fr,looplen=nframes)
		r1 = coords[fr]
		r1 -= mean(r1,axis=0)
		# computation of RMSD via SVD
		U,s,Vt = linalg.svd(dot(r0.T,r1))
		signer = identity(3)
		signer[2,2] = sign(linalg.det(dot(Vt.T,U)))
		RM = dot(dot(U,signer),Vt)
		rmsds.append(sqrt(mean(sum((r0.T-dot(RM,r1.T))**2,axis=0))))
	# pack
	attrs,result = {},{}
	result['rmsds'] = array(rmsds)
	return result,attrs	

The purpose of omnicalc is to separate the mundane details related to analyzing many simulations from the crux of the method calculation method. We hope that the format above makes it easier to understand the methodological details for a particular analysis.

In addition to organizing the analysis, the codes also provide a framework for creating rich visualizations using matplotlib, some of which can be seen in my recent study of protein-induced membrane curvature. Omnicalc represents my attempt to include all of the features of a simulation analysis pipeline that molecular biophysics students might want. Send me a message if you are interested in extending the code, or would like to learn more.

Factory

The factory is the icing on the two-layer cake composed of automacs and omnicalc. Its sole purpose is to provide a front-end for users who use both tools to generate and analyze new simulation data. To date, my collaborators and I have used the factory codes to generate hundreds of new simulations with an uptime of over ten months. The codes have been particularly useful for training new students to generate protein simulations.

The factory provides interfaces to both automacs and omnicalc. The interface tabulates the cumulative simulation time, provides visualizations, and organizes these simulations into batches. The following screencap provides an example for a long-running analysis project that I’m working on. The factory displays thumbnails for my simulation results, each of which has been analyzed and rendered with a series of calculation and plot functions.

An example of a completed protein simulation

The factory codes are written entirely in Django and are intended to deployed locally, on the same workstations used to simulate and analyze the resulting data sets. For full disclosure: I am not formally trained as an interface programmer. The GUI is really a bonus that was swiftly implemented by me and Joe while we built out the back end. Since automacs and omnicalc are easy to operate, there is always an opportunity to build a better interface to use them.

The factory interface is not necessary to use automacs and omnicalc, but it makes it easy to generate new data, train new investigators, and keep track of the progress of ongoing simulations. Documentation is currently underway, but I am happy to help you set up the codes if you send me a message.

Unit tests, validation, and deployment

In the last year, I have developed various unit tests for these codes, some of which are summarized on the validations list on the BioPhysCode portal. Many users experience difficulty in setting up the right operation system and software environment to run these codes. To some extent, the factory manages these problems by relying on Anaconda to provide python dependencies. To ensure that anybody can deploy our codes, we have also tested them inside Docker containers running Debian. I have explained the unit testing procedure on the portal, and hope to expand the collection of tests as I continue to develop and refine the codes.