Predict protein-protein interaction (PPI) utilizing multi-modality, including text, molecular structure (Graph), and numerical feature.
Transformer-based models and graph neural networks are dedicated for text and graph, respectively.
Core idea comes from Multimodal Graph-based Transformer Framework for Biomedical Relation Extraction.
Our's differs in that our model for protein structural modality process over residues rather than atoms.
We list the Accuracy/Precision/Recall/F1/AUROC scores of each models in the following table.
Followings are results of single run.
Results on HPRD50
Model | val/f1 | test/acc | test/prec | test/rec | test/f1 | test/auroc |
---|---|---|---|---|---|---|
Dutta et al. [1] (Text) | - | - | 90.44 | 58.67 | 71.17 | - |
Dutta et al. [1] (Text & Graph) | - | - | 94.79 | 75.21 | 83.87 | - |
Pingali et al. [2] † | - | - | 95.47 | 94.69 | 95.06 | - |
Text | 81.82 | 97.31 | 93.33 | 70.00 | 80.00 | 97.56 |
Graph | 0.00 | 85.00 | 0.00 | 0.00 | 0.00 | 47.71 |
Num | 10.81 | 83.85 | 7.69 | 10.00 | 8.70 | 48.45 |
Text & Graph | 85.71 | 96.92 | 87.50 | 70.00 | 77.78 | 98.31 |
Text & Num | 85.71 | 94.62 | 71.43 | 50.00 | 58.82 | 95.75 |
Graph & Num | 18.18 | 82.31 | 3.57 | 5.00 | 4.17 | 47.96 |
Text & Graph & Num (Concat) | 90.91 | 96.54 | 82.35 | 70.00 | 75.68 | 98.00 |
Text & Graph & Num (TensorFusion) | 78.26 | 93.46 | 56.52 | 65.00 | 60.47 | nan |
Text & Graph & Num (LowrankTensorFusion) | 69.57 | 93.85 | 60.00 | 60.00 | 60.00 | nan |
Results on Bioinfer
Model | val/f1 | test/acc | test/prec | test/rec | test/f1 | test/auroc |
---|---|---|---|---|---|---|
Dutta et al. [1] (Text) | - | - | 54.42 | 87.45 | 67.09 | - |
Dutta et al. [1] (Text & Graph) | - | - | 69.04 | 88.49 | 77.54 | - |
Pingali et al. [2] † | - | - | 78.49 | 79.78 | 80.86 | - |
Text | 83.99 | 93.79 | 79.30 | 86.38 | 82.69 | 95.93 |
Graph | 0.00 | 82.82 | 0.00 | 0.00 | 0.00 | 45.78 |
Num | 17.85 | 66.81 | 13.13 | 16.60 | 14.66 | 48.61 |
Text & Graph | 86.74 | 95.18 | 84.77 | 87.66 | 86.19 | 97.55 |
Text & Num | 86.71 | 94.30 | 80.54 | 88.09 | 84.15 | 96.60 |
Graph & Num | 23.92 | 66.74 | 19.27 | 29.36 | 23.27 | 48.58 |
Text & Graph & Num (Concat) | 85.59 | 94.74 | 82.73 | 87.66 | 85.12 | 97.63 |
Text & Graph & Num (TensorFusion) | 90.38 | 96.13 | 88.56 | 88.94 | 88.75 | nan |
Text & Graph & Num (LowrankTensorFusion) | 86.46 | 95.03 | 81.51 | 91.91 | 86.40 | nan |
†: The evaluation metrics in the author's implementation seem broken, though. Their text modality model is too simple yet has beaten previous models, including strong pretrained model-based, Bio-BERT-based one. Moreover, we found bugs in their implementation of metrics.
[1]: Pratik Dutta and Sriparna Saha, Amalgamation of protein sequence, structure and textual information for improving protein-protein interaction identification, In Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics
[2] Sriram Pingali, Shweta Yadav, Pratik Dutta and Sriparna Saha, Multimodal Graph-based Transformer Framework for Biomedical Relation Extraction, Findings of the Association for Computational Linguistics: ACL-IJCNLP
Followings are results of cross validation.
NOTE: AUROC metrics of the model with tensorfusion networks could not be calculated due to a technical reason.
Results on HPRD50
Model | val/f1 | test/acc | test/prec | test/rec | test/f1 | test/auroc |
---|---|---|---|---|---|---|
Text | 78.14 (±4.80) | 97.60 (±1.37) | 80.92 (±18.47) | 70.85 (±16.59) | 73.48 (±12.15) | 94.45 (±8.05) |
Graph | 10.60 (±5.43) | 67.05 (±20.25) | 2.57 (±2.28) | 27.13 (±22.83) | 4.68 (±4.14) | 51.37 (±7.86) |
Num | 18.07 (±8.53) | 69.69 (±19.87) | 7.67 (±2.70) | 37.82 (±18.00) | 11.45 (±1.98) | 55.81 (±9.57) |
Text & Graph | 74.68 (±4.99) | 97.36 (±0.75) | 75.39 (±10.89) | 67.78 (±19.24) | 68.99 (±9.81) | 92.31 (±7.11) |
Text & Num | 78.47 (±7.98) | 97.60 (±0.99) | 80.30 (±15.33) | 69.33 (±16.20) | 72.18 (±11.11) | 96.24 (±5.15) |
Graph & Num | 18.32 (±6.81) | 68.68 (±9.15) | 4.00 (±2.66) | 30.43 (±21.32) | 7.06 (±4.72) | 48.13 (±6.93) |
Text & Graph & Num (Concat) | 78.07 (±7.36) | 98.29 (±0.94) | 89.17 (±1.36) | 72.85 (±16.96) | 79.25 (±10.78) | 95.37 (±5.73) |
Text & Graph & Num (TensorFusion) | 79.45 (±5.86) | 97.29 (±0.78) | 77.69 (±20.12) | 71.50 (±17.29) | 70.19 (±6.82) | - |
Text & Graph & Num (LowrankTensorFusion) | 69.60 (±4.33) | 95.19 (±0.94) | 48.88 (±11.45) | 64.15 (±11.59) | 54.66 (±8.89) | - |
Results on Bioinfer
Model | val/f1 | test/acc | test/prec | test/rec | test/f1 | test/auroc |
---|---|---|---|---|---|---|
Text | 85.85 (±2.00) | 94.66 (±1.08) | 84.73 (±3.75) | 85.69 (±3.32) | 85.14 (±2.71) | 97.31 (±0.85) |
Graph | 1.48 (±2.25) | 81.70 (±1.23) | 5.71 (±11.43) | 1.26 (±2.53) | 2.07 (±4.14) | 51.06 (±1.07) |
Num | 17.24 (±4.52) | 70.27 (±2.37) | 16.79 (±2.66) | 17.43 (±4.85) | 16.99 (±3.85) | 50.94 (±3.63) |
Text & Graph | 84.41 (±2.08) | 94.22 (±1.20) | 85.58 (±3.46) | 81.58 (±6.32) | 83.37 (±3.58) | 96.24 (±1.66) |
Text & Num | 86.54 (±2.92) | 94.72 (±1.13) | 85.49 (±3.51) | 84.73 (±4.09) | 85.06 (±3.27) | 96.57 (±1.31) |
Graph & Num | 21.81 (±1.20) | 63.75 (±0.92) | 16.43 (±1.16) | 25.61 (±4.68) | 19.94 (±2.13) | 49.84 (±1.16) |
Text & Graph & Num (Concat) | 86.48 (±3.49) | 94.82 (±1.04) | 86.93 (±2.92) | 83.63 (±2.99) | 85.23 (±2.74) | 96.22 (±1.52) |
Text & Graph & Num (TensorFusion) | 84.54 (±4.87) | 94.44 (±1.56) | 84.54 (±3.86) | 84.28 (±5.72) | 84.35 (±4.46) | - |
Text & Graph & Num (LowrankTensorFusion) | 85.89 (±0.96) | 94.43 (±1.00) | 85.00 (±3.89) | 83.58 (±3.59) | 84.21 (±2.90) | - |
Dependency is maintained by poetry. Some dependencies (ones related to pytorch-geometric), however, can not be installed via poetry and need to be installed manually. Please follow instructions. We used libraries compatible with pytorch 1.10.0.
$ CUDA=cu102 # cpu, cu102, or cu113
$ pip install torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+${CUDA}.html
$ pip install torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+${CUDA}.html
$ pip install torch-geometric
- Download PPI data annotated with gene names from here to data/mm_data.
- Convert xlsm file to csv file (suppose
HPRD50_multimodal_dataset.csv
). - List up gene names in
HPRD50_multimodal_dataset.csv
bypreprocess/list_gene_names.py
. (You need specify dataset name) - Fetch pdb ids and ensemble ids corresponding to gene names by
preprocess/fetch_pdb_ensemble_id.py
. - Fetch pdb files corresponding to pdb ids by
preprocess/fetch_pdb_by_id.py
. - Complement pdb id by
preprocess/complement_pdb_id.py
.
$ python preprocess/list_gene_names.py data/mm_data/HPRD50_multimodal_dataset.csv data/mm_data/hprd50_gene_name.txt
$ python preprocess/fetch_pdb_ensemble_id.py data/mm_data/hprd50_gene_name.txt data/mm_data/genename2emsembl_pdb.json [hprd/bioinfer]
$ python preprocess/fetch_pdb_by_id.py data/mm_data/genename2emsembl_pdb.json data/pdb
$ python preprocess/complement_pdb_id.py data/mm_data/HPRD50_multimodal_dataset.csv data/[hprd/bioinfer]/all.csv data/mm_data/genename2emsebl_pdb.json
NOTE: Resultant csv file should be located at data/[hprd/bioinfer]/all.csv
The PDB files are translated into graphs on the fly. The result will be cached in the directory specified by CACHE_ROOT environment variable because processing takes a little time. You may set it by .env file (see .env.example).
The method is evaluated based on 5-fold cross validation (you may change the number of folds).
If needed, you may modify configuration in configs/train.yaml
and its dependents.
# default
$ python run.py
# train on CPU
$ python run.py trainer.gpus=0
# train on GPU
$ python run.py trainer.gpus=1
You may run all configure combination by make command.
# Run all combination
$ make all DATASET=hprd
# Run text model
$ make text DATASET=hprd
You can override any parameter from command line like this
$ python run.py trainer.max_epochs=20 datamodule.batch_size=64
All results are maintained by mlflow. You can launch mlflow server by mlflow ui
.
$ mlflow ui
Hyper parameters are listed in model configuration file as well. For more detail, you may refer to it.
Option | Values |
---|---|
Optimizer | AdamW |
batch size | 32 |
Maximum epochs | 20 |
Learning scheduler | Linear scheduler |
Learning rate | 5e-5 |
Warmup Epoch | 5 |
Weight Decay | 0.01 |
Node dimension of GNN | 128 |
The number of GNN layers | 2 |
Dimension of numerical feature | 180 |