this is a repository containing some command-line tools for implementing the popshift framework for docking. the purpose behind these tools are to allow users both to make alterations to these workflows for their own use, and also to apply them in similar fashion to the use they were put toward in the original popshift manuscript. they're provided with the gplv2 license, in the hope that others find them useful in their own projects, but also with the explicit intention of receiving improvements as others make them in their own contexts. below you'll find a discussion of the general workflow from that paper, as well as some documentation of what each script does. note that these command line tools all also have argparse
i/o tools, so if you're wondering about how to use them in more detail calling them with no arguments, or with the -h
or --help
flags will print some help text that may be more rapidly useful than leafing through this document.
these python scripts are designed to be used as command-line tools; some can also be imported into python interpreter sessions or notebooks, if one is interested in calling the functions from them independently. thus, getting python environments set up with the correct modules present is the main installation task. getting the parameterization program associated with the docking code you're interested in installed is the other major task.
# create the analysis postprocessing and frame-picking environment
mamba create -n popshift -c conda-forge obabel loos cython mdtraj
mamba activate popshift
git clone https://github.com/bowman-lab/enspara
cd enspara
python setup.py install
# create the docking environment
mamba create -n smina -c conda-forge smina obabel
Because of artificial conflicts in boost
version loos
and smina
are currently mutually exclusive in the same env. If either were built as compiled software this issue would not exist, as both are in fact compatible with a wide range of boost
versions.
the two preparation programs we've used are for autodock family programs, and generate pdbqt files. they are:
get it from conda-forge:
mamba install -c conda-forge openbabel
get it from apt (for debian-like systems):
sudo apt install openbabel
because these programs require python 2, and therefore don't play nicely with any other python tools written anytime recently, they are best installed from the tarball available from the adfr website. you should follow the directions there to get them added to your path.
note that you can view the licenses these programs are being provided under on that same site. because we only need prepare_receptor
and prepare_ligand
, these are available under lgplv2 at time of writing.
right now our scripts are set up to either use vina or smina. in both cases the easiest way to get either set up is to install into a conda env for docking:
mamba create -n docking -c conda-forge smina vina
we have run into an issue with using vina from conda-forge
source; in particular, we found there was a bug that produced pretty significant variability in docking scores given the same pose, which is pretty bad for this application where the scores themselves are being used. We also noticed another bug whereby sometimes, seemingly at random across thousands of docking runs, a handful of poses will be given an extra 10-15 kcal/mol of favorability. there are several other energy function bugs on their issue tracker that this could be. until the version on conda forge is greater than 1.2.4
we'd recommend compiling the develop
branch of vina hosted on github, following their install instructions for building from source, focusing on the python bindings since that's the only part of the install used by docking_parallel.py
.
note if you are trying to get the right compilers available you can install those from conda forge via the compilers
repository. here's an example install line:
mamba create -n built-vina -c conda-forge compilers boost-cpp swig numpy
then you'd set up the python bindings:
conda activate built-vina
cd path/to/cloned/autodock-vina/build/python
rm -rf build dist *.egg-info
python setup.py build install
to run the python tools associated with trajectory manipulation and post processing you'll need loos
, a trajectory handling library, and enspara
, an msm/utilities library. you'll also very likely need either pyemma
, deeptime
, or both.
a tidy way to install all of these into the same environment is to grab the libraries that are available on conda-forge
and also the depeendencies that are needed for enspara, then clone enspara and build it using its setup.py
script.
mamba create -n popshift -c conda-forge loos deeptime pyemma mdtraj cython mpi4py
conda activate popshift
git clone https://github.com/bowman-lab/enspara
cd enspara
python setup.py install
- ADFR suite (for
prep_parallel.py
andprepare_ligand.py
)-
https://ccsb.scripps.edu/adfr/downloads/
cp prepare_ligand and prepare_receptor into $PATH
-
- Autodock Vina python bindings (for Vina docking)
-
git clone https://github.com/ccsb-scripps/AutoDock-Vina
cd AutoDock-Vina/build/python
conda install -c conda-forge numpy boost-cpp swig
rm -rf build dist *.egg-info (to clean previous installation)
python setup.py build install
-
- Smina
-
mamba install -c conda-forge smina
-
- enspara (for
popshift.py
,pick_align_frames.py
extract_scores.py
)- Follow enspara install instructions with
conda
. - If using PyEMMA discretized trajectories, then it must be installed:
-
mamba install -c conda-forge pyemma
-
- Follow enspara install instructions with
- LOOS (for
pick_align_frames.py
,draw_box.py
,add_bonds_posthoc.py
, andrmsd_receptor_ligand.py
)-
mamba install -c conda-forge loos
- Can also be compiled following the instructions on the loos github
-
the scripts provided here are supposed to be modular enough that you can adjust them to your needs, or swap one part out in place of another part--we are certainly going to do this as our research on this framework evolves, so why shouldn't you? however, there are a few pathways that probably make general sense to understand, and the easiest way to begin to see what order things need to happen in is probably with some prescribed workflows. find some below:
in the original popshift ms, we considered how to apply the popshift framework to the ensemble docking problem. the general workflow to actuate that study, with the tool(s) needed, follows:
- Obtain a satisfactory msm representing ligand-free simulations of your receptor.
- Determine a selection that picks out your pocket of interest.
- Pick and align frames from the msm using
pick_align_frames.py
to ensure that all the residues that should be in your box are nicely co-aligned. If working from the minimal set-up, use thepopshift
environment.- Check that things landed where you expected them to land using a visualizer like
pymol
anddraw_box.py
. - Noting that the alignment will recenter the coordinate system of all the models at the centroid of the atoms selected as the 'pocket' atoms, so boxes centered at or close to
0,0,0
are probably good choices in general.
- Check that things landed where you expected them to land using a visualizer like
- Prepare receptors and ligands (using either openbabel or autodock tools
prepare_receptor
andprepare_ligand
), here you have some alternatives:- Use
prepare_ligand
andprepare_receptor
on the command line directly with your favorite shell concurrency tool such asxargs
or gnuparallel
. - Alternatively, use
obabel
with the-o pdbqt
flag. In the minimal set-up use thesmina
environment.
- Use
- Dock to each sample, saving the docking score in the file containing the ligand pose, using
docking_parallel.py
. In the minimal set-up use thesmina
env. - Extract scores from these output files and collate them into arrays using
extract_scores.py
. In the minimal set-up use thepopshift
env. - Use the score arrays to do popshift calculations, with
popshift.py
. Also use thepopshift
env for this. - Plot stuff, compare to experiments, and generally have too much fun, by reading the json that
popshift.py
outputs into your downstream scripts or notebooks.
Note that the docking here is done with either vina or smina, so steps 4, 5, and 6 are somewhat particular to how those codes do that. You can of course color outside of the lines here, but you'll probably be adding some functionality to the scripts we have in order to do so. The things popshift.py
needs are an array of affinity assessments that correspond to conformational bins and an array of estimates for the probability of those bins with no ligand present.
For the docking below to work as completely independent jobs, we extract frames representing each state, then save them into subdirectories where each subdirectory name corresponds to the msm state index. Each file name consists of a trajectory index, then a frame index, separated by a dash. Here is an example of frames picked from a 25 state msm built from five longish trajectories (obtained by calling tree -p '*.pdb' -v receptor
) output abbreviated for clarity:
receptor
├── 0
│ ├── 0-55257.pdb
│ ├── 0-56428.pdb
│ └── 1-43334.pdb
├── 1
│ ├── 2-105110.pdb
│ ├── 2-175513.pdb
│ └── 3-48068.pdb
├── 2
│ ├── 0-157652.pdb
│ ├── 0-158908.pdb
│ └── 0-191040.pdb
...
└── 24
├── 4-24060.pdb
├── 4-186351.pdb
└── 4-189200.pdb
25 directories, 75 files
The command line I used to create this was:
frame_picker=path/to/popshift/pick_align_frames.py
python $frame_picker \
--find-bonds 1.95,1.35\
--make-receptor-sel-chain-a\
--assignments dtrajs.h5\
--align-resid-list pocket-resids.txt\
--number-frames 3\
--mapping mapping.json\
$system_name\
prot_masses.pdb\
eq-probs.npy\
random\
'!hydrogen'\
traj_list.txt
Here are the options and arguments, in order:
find-bonds
: find bonds using two different cutoffs, the first for heavy atoms and the second for hydrogen. Not necessary if bonds are present in the file used for 'model'. Better to have bonds written into pdbs so that the prep program doesn't need to build them.make-receptor-sel-chain-a
: make the written pdbs have chain a identification. Some system prep tools require this.assignments
: the assignments or discretized trajectory file, saved as a 'ragged array. alternatively, an array of center indices saved as a.npy
. To subsample randomly the full assignments file is obligatory, and the order of the trajectories in the assignments file and the coordinate trajectory paths must match.1align-resid-list
: this expects a text-file thatnumpy.genfromtxt
can turn into a 1-d array ofint
s. These correspond to theresid
s from the model that should be used in the iterative alignment.number-frames
: select this many frames from each msm bin.mapping
: this corresponds to an enspara-style 'to' trim mapping; a dictionary containing the indexes of the original assignments trajectories as keys and the indexes they should be mapped to after ergodic trimming as values.system_name
: the name you'd like to give to this particular extraction. will become the directory name within which thereceptor
dir containing the tree of selected frames is written.prot_masses.pdb
: this is the model file. For this particular command I was using aprot_masses.pdb
generated by gromacs, which means it doesn't have connectivity which is why the example also containsfind-bonds
.eq-probs.npy
: a numpy array containing equilibrium probabilities. Could also be a pyemma MSM saved as a pickle file, in which casemapping
would not have been necessary. Used to communicate to the script how many states are active.random
: the sampling mode requested here randomly picks frames from each msm bin without replacement. There are other options; consult the help statement.centers
will switch the tool to operating on the provided coordinate trajectory as a centers trajectory.!hydrogen
: the loos selection string to apply to the residues selected as the pocket for the purposes of alignment. This string means 'not hydrogen', or in other words 'all heavy atoms'.traj_list.txt
: a text file containing paths to the coordinate trajectoies frames should be extracted from. A sequence of at least one paths may also be provided here, but as previously mentioned the order of the trajectories in the assignments file and the coordinate trajectory paths must match.1 because of this I much prefer to create a list of trajectories and save it as a text file when I cluster, then pass that list around to subsequent tools. If operating in centers mode then this argument should be a path to the trajectory of centers.
Calling pick_align_frames.py --help
will provide more information about alternative strategies for picking frames (for example, using just cluster centers, or picking particular numbers of frames from each bin). It's also worth noting that pick_align_frames can be imported into other python sessions, so if you want to rip random frames into sub-trajectories for other reasons this is possible. Finally, note that subsample trajectories that are not PDBs can also be written to disk if desired. This can be nice if trying to use normal trajectory analysis tools on the selected frames, because reading in piles of PDBs instead of one dcd per msm bin can be annoying and slow.
we prepare each of the receptor conformations as a completely independent system, so each is represented by its own pdbqt file and can thus be run as an isolated job on whatever hardware is available. this leaves some things to be desired, but as a first pass it's worked for us.
In general these scripts are set up to play nicely with one another by assuming that the directory structure of the intermediate files will be maintained--this isn't a hard requirement, but it does keep things convenient. Because we are using other command-line tools to prepare ligands and receptors for docking, we are really just expecting that the prepared files will be in the same place as the files they were prepped from. There are a lot of ways to call a command-line tool on a file such that it writes a new file with the same path but a different file extension. Here are several I have used in the past:
Using gnu parallel:
# use find to get the paths rel. to cwd;
# write to file to prevent arg-list too long
find frame_picking_system_dir/receptor/ -name '*.pdb' > receptor_paths.txt
# don't use too many jobs; i/o intensive task will be bound by disk perf.
parallel -j 8 obabel -opdbqt {} -o {}qt :::: receptor_paths.txt
Or with prepare_receptor
:
parallel -j 8 prepare_receptor -r {} -o {}qt :::: receptor_paths.txt
What's going on here? Scrutinizing the docs for parallel suggests that parallel will be running the commandline on each of the elements in the file when in ::::
input mode. It substitutes each line into the 'command line' you provide as an argument wherever the curly braces are, then runs that as its own subprocess.
Using xargs as an alternative to parallel:
find frame_picking_system_dir/receptor/ -name '*.pdb' | xargs -i% -n 1 -p 8 obabel -opdbqt % -o %qt
If only dealing with a few frame-picking runs (or maybe just one) for a relatively parsimonious msm (one with few states), we can afford to do this sequentially (nicer if you're getting errors, as well). In this case, by far the easiest thing to do is write a simple loop in the shell of your choosing. Here's a while loop that takes advantage of the redirect-to-file from find
given in the parallel
example:
while read receptor_path; do
echo $receptor_path
obabel -opdbqt $receptor_path -o ${receptor_path}qt
done < receptor_paths.txt
Ligands can also be prepped by command-line tools; the docking script docking_parallel.py
expects you to provide the ligands as either a text-file of ligand paths or as a series of command-line arguments. As such, you can organize these files how you see fit. I usually put them in their own directory, then prep 'everything' with a certain file extension within that directory. So for example:
# assume there's a directory called ligands with .sdf or .mol2 files in it
parallel -j 8 obabel {} -h -isdf -opdbqt -o{.}.pdbqt ::: ligands/*.sdf
Note: the -h
flag adds hydrogens. Don't throw it if you don't want that.
Alternatively, if you have the ADFR suite installed and would like to use prepare_ligand
instead of openbabel (note you need .mol2
or .pdb
files for prepare_ligand
, as of this writing):
parallel -j 8 prepare_ligand -l {} -ahydrogens -o{.}.pdbqt ::: ligands/*.mol2
Again here there are many ways to organize docking so long as the docking code is available on the command-line. We like to use jug, because it integrates well with our workload manager (slurm). We have also used it with lsf. Following is an example sbatch script, with comments about the various options for docking_paprallel.py
#!/bin/bash
#sbatch --array=0-99
#sbatch -n 1
#sbatch -p all
#sbatch -o dockouts/%a-%a.out
#sbatch -e dockouts/%a-%a.out
receptor_dir=system_name_from_framepicking
jug execute -- path/to/popshift/docking_parallel.py\
-d smina\
-e 32\
$receptor_dir/receptor\
$receptor_dir/'12xsmina'\
'0,0,0'\
'12,12,12'\
all-ligands/*.pdbqt
The sbatch pragma lines are just asking for a 100 element array job where each job is a one-core job with default everything on the 'all' partition. obviously you can adjust that if you'd like, but the tool expects to just run single-threaded docking jobs, so there's no point to increasing the per-job core count with the script as written. the command lines are as follows:
jug execute --
: saying this invokes jug instead of the python interpreter, issuing the execute command. the--
tells the jug program that the part of the command line it should read is concluded; if this is omitted and the command line contains long-option flags that should be going todocking_parallel.py
jug will interpret them as being for it instead.d
is the 'docking-algorithm' flag, and picks from the available docking options--at present eithersmina
orvina
.e
is the 'exhaustiveness' flag; for vina, the exhaustiveness flag is the number of mc runs to perform.$receptor_dir/receptor
: the directory containing the receptor conformations to dock to.$receptor_dir/12xsmina
: the nickname to give the docking run. It's preferable to keep all the docking runs organized in the same directory to theirreceptor
dir for ease of later scripting.0,0,0
: the coordinates of the center, separated by commas. Note that if you usedpick_align_frames.py
to get receptor structures to dock to, 0,0,0 is a relatively good choice for a center because the coordinates get zeroed by the frame.12,12,12
: the edge lengths of the x, y, and z edges of the docking box.all-ligands/*.pdbqt
: the last n arguments should be prepped ligand PDBQTs. I bunged the prepped PDBQTs of the ligands into a directory calledall_ligands
, then provided them to the command-line as a glob. Note that which order the ligands get docked in is not in general important.
The output from docking will be written into a directory structure that is within a directory given to the nickname for the docking run, with sub-directories using the stem of the file-names of the ligand pdbqts, and the directory structure beneath that containing numbered directories for each msm bin, and each docked pose containing only the ligand coordinates will have a file-name matching the receptor structure to which it was docked. So for the above frame-picking example, supposing the ligand files docked in were benzene.pdbqt
, toluene.pdbqt
, and p-xylene
:
tree -d 12xsmina
├── benzene
├── p-xylene
└── toluene
3 directories
tree 12xsmina/benzene
benzene
├── 0
│ ├── 0-55257.pdbqt
│ ├── 0-56428.pdbqt
│ └── 1-43334.pdbqt
├── 1
│ ├── 2-105110.pdbqt
│ ├── 2-175513.pdbqt
│ └── 3-48068.pdbqt
├── 2
│ ├── 0-157652.pdbqt
│ ├── 0-158908.pdbqt
│ └── 0-191040.pdbqt
...
└── 24
├── 4-24060.pdbqt
├── 4-186351.pdbqt
└── 4-189200.pdbqt
25 directories, 75 files
There are other jug commands. In order to issue them for a jug file that has command line arguments, one must make sure that the same command-line arguments are provided. One must also be using the same versions of python and jug, so its important to have the environment the dock job was launched from active. Thus, often the easiest thing to do is make the sbatch into a script that takes an argument, and inserts that argument in place of execute
:
#!/bin/bash
receptor_dir=system_name_from_framepicking
jug $1 -- path/to/popshift/docking_parallel.py\
-d smina\
-e 32\
$receptor_dir/receptor\
$receptor_dir/'12xsmina'\
'0,0,0'\
'12,12,12'\
all-ligands/*.pdbqt
Call this script issue-jug-cmds-dp.sh
or similar. I could then use this script to obtain jug status reports, or to invalidate the results of a calculation so that I could rerun it:
issue-jug-cmds-dp.sh status
issue-jug-cmds-dp.sh 'invalidate dock_smina'
issue-jug-cmds-dp.sh cleanup
Now we are ready to do something with the per-state affinity and pose estimates we've made. We'll extract the scores, then use them to make affinity estimates.
Extract scores with extract_scores.py
. This will pull the values of docking scores out of the various pose files and save them to an enspara
ragged array in h5 format, since it is not in general true that the number of frames extracted per bin is always the same. This is a pretty simple tool; here is an example command line:
python path/to/popshift/extract_scores.py -n 8 -t smina 12xsmina
the commandline works as follows:
n
: the number of cores to use; this job is pretty i/o bound, but I have found that using a handful of cores can speed it up.t
: the type of output file the script can read. Will match thed
option fromdocking_parallel.py
.12xsmina
: the docking run nickname we would like to run the extraction out of.
By default, the resulting score extracts will be written to extracted_scores
, in the same directory as the docking run is in. It takes the path to that directory, then goes up one level. Within extracted_scores
there will be a dir for each docking run you've extracted, and it will be named to match the directory name for the docking run. There will be an .h5
file within this directory for each ligand file name. The elements will correspond to the scores extracted from the pose files.
To obtain binding free energy estimates and reweighted state populations, use popshift.py
. This program was designed to either be a commandline tool that will perform these analyses on files arranged in the expected way, or as a small module with functions to perform the desired operations, if working within a notebook is desired. Here's an example of command-line usage, but as with all the other tools this one has broader summary documentation provided by the output of the --help
flag.
extracts=extracted_scores/12xsmina
python path/to/popshift/popshift.py \
-n 8\
--out $extracts/binding-calx\
--reweighted-eq 1\
bin-samples\
path/to/eq-probs.npy\
$extracts/*.h5
The commandline works as follows:
n
: number of threads--this code haspython mutltiprocessing
style threads, but unless one has huge numbers of ligands tasks are so fast using this threading is imperceptably different.out
: the output path to write the affinity calculation json and any reweighted free energy (ragged) arrays to.reweighted-eq
: the concentration of ligand you'd like to reweight your equilibrium probabilities by; here I've chosen 1M to simulate 'saturating' ligand conditions. Note that reweighted probabilities will be per frame sampled, not per bin--they'll match the shape of the extracted scores arrays.bin-samples
: the mode frame picking was performed in. Should match the argument topick_align_frames.py
.path/to/eq-probs.npy
: the equilibrium probabilities for the apo simulations used throughout.$extracts/*.h5
: the extracted scores, in no particular order. Assumes the stem of the path to each score.h5
is the name of the ligand, for keying and naming its output.
Because the output is matched to the extracted scores, I like to write to a sub-directory of the extracted scores directory. If reweighted free energies were not requested, then only the scores post processing will be written. Three types of files may be created inside the directory provided as an arg to --out
. One calx.json
with the results of the affinity analysis inside it; one {ligand_name}-eq_probs.h5
file with reweighted equilibrium probabilities corresponding to each conformation analyzed. This array will have the same shape as the extracted scores arrays. There will also be a difference in population array, saved as a change in free energy per state between the frame weights and the reweights. In other words a {ligand_name}-dg.h5
.
The output of the script will be a json with several different averages reported for each ligand, as well as a "log"
. it's been indented to make it more human readable, but the standard python json module can turn it into a dictionary of dictionaries and lists with no special commands, so you can also inspect or plot its contents in a python interpreter or notebook as you see fit.
Within the "results"
object, there will be ligand names associated to sub-objects, each of which will be a dictionary keyed by ligand names with values that are themselves dictionaries with each of the type of calculations pop-shift can use to aggregate a binding site into a macroscopic dissocation constant. These are the "popshift dG"
free energy with the provided kT, a "popshift K_D"
which is the same popshift estimate of binding affinity as a dissociation constant in units of micromolar, the "best score"
, the "simple average"
, and the "weighted average"
, which corresponds to the expectation over apo equilibrium probabilities we previously referred to as boltzmann docking.
Footnotes
-
the order of
*.xtc
, python'sglob.glob('*.xtc')
python'spathlib.path.glob('*.xtc')
andfind . -name '*.xtc'
will not necessarily be consistent across runs. The first of these results in ascii-betical ordering (meaning thattraj-11.xtc
will come beforetraj-2.xtc
for bash and derivative shells). The other three, if unsorted, will come in the file-system's order. Importantly, the filesystem order can change if the contents of the filesystem change, which is why many references point out that the returns of such commands--when unsorted explicitly--are arbitrary. This is an easy way for one to shoot oneself in the foot. ↩ ↩2