This repository accompanies our paper Understanding the Performance of Knowledge Graph Embeddings in Drug Discovery and enables replication of the key results.
In this work we investigate the predictive performance of five KGE models on two public drug discovery-oriented KGs. Our goal is not to focus on the best overall model or configuration, instead we take a deeper look at how performance can be affected by changes in the training setup, choice of hyperparameters, model parameter initialisation seed and different splits of the datasets. Our results highlight that these factors have significant impact on performance and can even affect the ranking of models. Indeed these factors should be reported along with model architectures to ensure complete reproducibility and fair comparisons of future work, and we argue this is critical for the acceptance of use, and impact of KGEs in a biomedical setting.
The models we investigate are:
With the following datasets being used:
Part of this study was a detailed search over the hyperparameter space for all models and across both datasets. Below we report the best overall values below for each dataset. Note that these values are taken from experiments with a fixed training setup consisting of the Adagrad optimiser and the use of Margin Ranking Loss with no inverse relations.
Model | Embedding Size | Num Epochs | Learning Rate | Num Negatives |
---|---|---|---|---|
ComplEx | 272 | 700 | 0.03 | 91 |
DistMult | 80 | 400 | 0.02 | 41 |
RotatE | 512 | 500 | 0.03 | 41 |
TransE | 304 | 500 | 0.02 | 61 |
TransH | 480 | 800 | 0.005 | 1 |
Model | Embedding Size | Num Epochs | Learning Rate | Num Negatives |
---|---|---|---|---|
ComplEx | 464 | 600 | 0.09 | 91 |
DistMult | 480 | 100 | 0.05 | 71 |
RotatE | 448 | 900 | 0.06 | 31 |
TransE | 448 | 600 | 0.1 | 91 |
TransH | 368 | 900 | 0.06 | 31 |
The dependencies required to run the notebooks can be installed as follows:
$ pip install -r requirements.txt
The code relies primarily on the PyKEEN package, which uses PyTorch behind the scenes for gradient computation. If you want to run the experiments, it would be advisable to ensure you install a GPU enabled version of PyTorch first. Details on how to do this are provided here.
Note that all of the models and both of the datasets are provided as part of the the PyKEEN package so there is no need to download the datasets separately.
This repository contains code to replicate the experiments detailed in the accompanying manuscript. Each experiment is run by using the combination of a python scrip and associated YAML configuration file. The general format of this is experiment.py with the path to the config file providing the remaining information in the format : experiment/dataset/model.
We now provide examples of running each experiment. Please note that the results of these experiments will be saved in the results directory at the root of this repository.
The baseline experiments are run using sensible default hyper-parameters and can be used to compare again more optimised values. The baseline experiments can each be run as follows:
$ python src/baseline.py -c config/baseline/hetionet/rotate.yaml
Where both the dataset and model can be chosen from those available.
The HPO experiments will perform 100 repeats to find the optimal hyper parameters for a given dataset and model combination. The HPO experiments can each be run as follows:
$ python src/hpo.py -c config/hpo/hetionet/rotate.yaml
Where both the dataset and model can be chosen from those available. Note that by default we use the TPE search method, however a random search can also be used by changing the value for the sampler configuration option in the YAML files.
The previous experiment optimised the hyperparameters across all relation types. In these experiments, we optimise over only edges between gene and disease entities.
$ python src/hpo_relation.py -c config/hpo_relation/hetionet/rotate.yaml
Note that only the Hetionet dataset can be used for these experiments.
The model seeds experiments are designed to assess how variations in the random seed used to initialise the model parameters affect predictive performance. As such, ten repeats over different random seeds are performed. The model random seed experiments can each be run as follows:
$ python src/model_seed_repeats.py -c config/model_repeats/hetionet/rotate.yaml
Where both the dataset and model can be chosen from those available.
These experiments are designed to assess how changes in the train/test split in the dataset can affect predictive performance. These experiments can each be run as follows:
$ python src/dataset_repeats.py -c config/datasplits/hetionet/rotate.yaml
Where both the dataset and model can be chosen from those available.
In these experiments, we investigate how the training setup (loss function, optimizer and use of inverse relations) can impact predictive performance. Please note this this experiment will run the repeats over all models and datasets from a single script - thus there is no need to specify a specific model or dataset choice.
$ python src/ablation.py
Please consider citing the paper for this repo if you find it useful:
@article{bonner2022understanding,
title={Understanding the performance of knowledge graph embeddings in drug discovery},
author={Bonner, Stephen and Barrett, Ian P and Ye, Cheng and Swiers, Rowan and Engkvist, Ola and Hoyt, Charles Tapley and Hamilton, William L},
journal={Artificial Intelligence in the Life Sciences},
pages={100036},
year={2022},
publisher={Elsevier}
}
License