Source code and input data files for the manuscript "The Impact of Curation Errors in the PDBBind Database on Machine Learning Predictions of Protein-Protein Binding Affinity" by Gans et al. If you run into any problems, please describe the problem on the GitHub BIORAD issues tab and we'll get back to you as soon as we can.
This repository contains the necessary computer code and input files to perform the following tasks:
- Download Open Access full-text publications from PubMed Central for protein heterodimer PDB records in the PDBBind.
- Download the PDB files for the protein heterodimer PDBBind records in PDBBind.
- Add hydrogens to the downloaded PDB files using the Open Babel program.
- Train and test a simple random forest-based machine learning algorithm for predicting the equilibrium dissociation constant (actually log10(KD) ) from 3D protein structure coordinates.
- This step is only required for obtaining a local copy of the full-text publications associated with the protein heterodimer interactions in the PDBBind. This information was used to manually curate the extraction of KD values from the PubMed Central Open Access scientific literature.
- Run the python script
scripts/download_oa.py
using the provided CSV file of PDB and PubMed Central accessions (data/pdbbind_oa.csv
) to download all of the corresponding files from PubMed Central. Using the output directory specified by the user (-d <output directory>
), this script will create a separate subdirectory (labeled by PDB accession) to store the full-text information associated with each PDB accession. Note that the same manuscript may be associated with more than one PDB accession and, as a result, will be downloaded multiple times.
- The PDB files for each protein heterodimer record in the PDBBind can be downloaded using the
pdb_batch_download.sh
shell script available from the RCSB PDB. This script accepts a comma-separated list of PDB accessions (using the-f
) to download. The filedata/pdbbind_heterodimer_pdb.csv
contains a comma-separated list of PDB accessions for the protein heterodimer records in the PDBBind.- After obtaining the
pdb_batch_download.sh
script, running the commandpdb_batch_download.sh -f data/pdbbind_heterodimer_pdb.csv -o heterodimer_pdb
(from the root directory of the BIORAD repository) will download all of the protein heterodimer PDB files to the directoryheterodimer_pdb
. The name of the destination directory for storing the PDB files,heterodimer_pdb
, is suggested, but not required.
- After obtaining the
- After downloading all of the PDB files for the protein heterodimer records in the PDBBind, hydrogen atoms can be added using the Open Babel program.
- After downloading and installing the Open Babel program, the
scripts/add_hydrogens.sh
script can be used to add hydrogen atoms. This script may be edited to specify the input directory of downloaded PDB files (e.g.,heterodimer_pdb
) and the desired output directory for storing the modified PDB files (e.g.,heterodimer_pdb_hydrogen
)
- The
biorad
program is implemented in C++, can run on Linux or macOS, and has the following dependancies:- A C++ compiler that supports the C++17 standard (most modern C++ compilers do).
- [optional] A C++ compiler that supports OpenMP for multi-threading. By default, multi-threading is disabled to maintain compatibility with the default macOS C++ compiler.
- If your C++ compiler supports OpenMP (most compilers on Linux), please edit the provided
src/Makefile
to uncomment the OpenMP flag;OPENMP = #-fopenmp
→OPENMP = -fopenmp
- Please note that the default macOS compiler does not suport OpenMP.
- If your C++ compiler supports OpenMP (most compilers on Linux), please edit the provided
- After the dependencies have been satisfied, run the
make
command from thesrc
directory. This will create thebiorad
executable.
Running the BIORAD software to train and test random forest models for predicting protein-protein binding affinity
- Running the script
scripts/batch_curration1.sh
in the root directory of the BIORAD respository will automatically run thebiorad
program on the different subsets of PDBBind and manually curated data.- This script assumes that the input PDB files are stored in the
heterodimer_pdb_hydrogen
directory. The script will need to be modified if this directory name is not being used. - This script will attempt to write output files (summarizing the machine learning test results) to a directory called
output
. The script creates theoutput
directory if it does not already exist. - Please note that output files generated by
scripts/batch_curration1.sh
(with slightly different directory names) are already included in theoutput
directory. If you don't want to overwrite these files, they should be renamed.
- This script assumes that the input PDB files are stored in the
- Running the
biorad -h
command will display the set of allowed command line arguments for thebiorad
program.