BIORAD

Source code and input data files for the manuscript "The Impact of Curation Errors in the PDBBind Database on Machine Learning Predictions of Protein-Protein Binding Affinity" by Gans et al. If you run into any problems, please describe the problem on the GitHub BIORAD issues tab and we'll get back to you as soon as we can.

This repository contains the necessary computer code and input files to perform the following tasks:

Download Open Access full-text publications from PubMed Central for protein heterodimer PDB records in the PDBBind.
Download the PDB files for the protein heterodimer PDBBind records in PDBBind.
Add hydrogens to the downloaded PDB files using the Open Babel program.
Train and test a simple random forest-based machine learning algorithm for predicting the equilibrium dissociation constant (actually log₁₀(K_D) ) from 3D protein structure coordinates.

Downloading Open Access full-text publication from PubMed Central

This step is only required for obtaining a local copy of the full-text publications associated with the protein heterodimer interactions in the PDBBind. This information was used to manually curate the extraction of K_D values from the PubMed Central Open Access scientific literature.
Run the python script scripts/download_oa.py using the provided CSV file of PDB and PubMed Central accessions (data/pdbbind_oa.csv) to download all of the corresponding files from PubMed Central. Using the output directory specified by the user (-d <output directory>), this script will create a separate subdirectory (labeled by PDB accession) to store the full-text information associated with each PDB accession. Note that the same manuscript may be associated with more than one PDB accession and, as a result, will be downloaded multiple times.

Download the PDB files for the protein heterodimer PDBBind records

The PDB files for each protein heterodimer record in the PDBBind can be downloaded using the pdb_batch_download.sh shell script available from the RCSB PDB. This script accepts a comma-separated list of PDB accessions (using the -f) to download. The file data/pdbbind_heterodimer_pdb.csv contains a comma-separated list of PDB accessions for the protein heterodimer records in the PDBBind.
- After obtaining the pdb_batch_download.sh script, running the command pdb_batch_download.sh -f data/pdbbind_heterodimer_pdb.csv -o heterodimer_pdb (from the root directory of the BIORAD repository) will download all of the protein heterodimer PDB files to the directory heterodimer_pdb. The name of the destination directory for storing the PDB files, heterodimer_pdb, is suggested, but not required.

Add hydrogen atoms to the downloaded PDB files

After downloading all of the PDB files for the protein heterodimer records in the PDBBind, hydrogen atoms can be added using the Open Babel program.
After downloading and installing the Open Babel program, the scripts/add_hydrogens.sh script can be used to add hydrogen atoms. This script may be edited to specify the input directory of downloaded PDB files (e.g., heterodimer_pdb) and the desired output directory for storing the modified PDB files (e.g., heterodimer_pdb_hydrogen)

Build the `biorad` software for predicting protein-protein binding affinity

The biorad program is implemented in C++, can run on Linux or macOS, and has the following dependancies:
- A C++ compiler that supports the C++17 standard (most modern C++ compilers do).
- [optional] A C++ compiler that supports OpenMP for multi-threading. By default, multi-threading is disabled to maintain compatibility with the default macOS C++ compiler.
  - If your C++ compiler supports OpenMP (most compilers on Linux), please edit the provided src/Makefile to uncomment the OpenMP flag; OPENMP = #-fopenmp → OPENMP = -fopenmp
  - Please note that the default macOS compiler does not suport OpenMP.
After the dependencies have been satisfied, run the make command from the src directory. This will create the biorad executable.

Running the BIORAD software to train and test random forest models for predicting protein-protein binding affinity

Running the script scripts/batch_curration1.sh in the root directory of the BIORAD respository will automatically run the biorad program on the different subsets of PDBBind and manually curated data.
- This script assumes that the input PDB files are stored in the heterodimer_pdb_hydrogen directory. The script will need to be modified if this directory name is not being used.
- This script will attempt to write output files (summarizing the machine learning test results) to a directory called output. The script creates the output directory if it does not already exist.
- Please note that output files generated by scripts/batch_curration1.sh (with slightly different directory names) are already included in the output directory. If you don't want to overwrite these files, they should be renamed.
Running the biorad -h command will display the set of allowed command line arguments for the biorad program.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
cluster		cluster
data		data
output		output
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BIORAD

Downloading Open Access full-text publication from PubMed Central

Download the PDB files for the protein heterodimer PDBBind records

Add hydrogen atoms to the downloaded PDB files

Build the `biorad` software for predicting protein-protein binding affinity

Running the BIORAD software to train and test random forest models for predicting protein-protein binding affinity

About

Releases

Packages

Languages

License

lanl/BIORAD

Folders and files

Latest commit

History

Repository files navigation

BIORAD

Downloading Open Access full-text publication from PubMed Central

Download the PDB files for the protein heterodimer PDBBind records

Add hydrogen atoms to the downloaded PDB files

Build the biorad software for predicting protein-protein binding affinity

Running the BIORAD software to train and test random forest models for predicting protein-protein binding affinity

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Build the `biorad` software for predicting protein-protein binding affinity

Packages