aiDIVA is an analysis pipeline that combines a pathogenicity-based approach and an optional evidence-based approach with state-of-the-art large language models (LLMs) to identify potential disease causing variants in a given rare disease sample.
aiDIVA comprises the following steps:
-
A pathogenicity-based approach that utilizes the predictions of a random forest model trained on ClinVar data, supplemented with phenotype information given as HPO terms.
-
An optional evidence-based approach includes the ranks and scores that can be obtained from the VariantRanking tool included in ngs-bits.
-
State-of-the-art LLMs are utilized to refine the ranking results from the previous two approaches.
-
A final meta model is used to combine all preliminary results and create a final ranking of the variants. For this model we also use a random forest.
If you use aiDIVA in your work please cite our preprint:
aiDIVA - Diagnostics of Rare Genetic Diseases Using Large Language Models (link)
Please report any issues or questions to the aiDIVA Issue Tracker.
The program is written in Python 3
Latest used version: 3.12.3
The following additional libraries need to be installed in order to use the program (latest used version):
- networkx (v3.4.2)
- numpy (v1.26.4)
- openai (v1.60.2)
- pandas (v2.2.3)
- pysam (v0.22.1)
- pyyaml (v6.0.2)
- scipy (v1.15.1)
- scikit-learn (v1.3.2)
For easy package installation in your Python virtual environment we included a requirements.txt just run pip install -r requirements.txt to install all necessary packages (the versions in the requirements.txt match our own setup at that time).
If a newer scikit-learn version is used it is advised to create a new model with the newer scikit-learn version.
To run the aiDIVA software you need a TAB separated file containing the annotation information for every variant present in your sample file.
Detailed instructions on how to run the software and what columns need to be present in the input table can be found here.
If you don't have an annotated table with the necessary columns mentioned before. You can use the run_annotation script provided in the annotation folder to create a table with the necessary information. Before you use this annotation script make sure that the necessary database resources and tools are present on your system and the paths in the configuration file are set correctly (IMPORTANT: use the correct configuration file, it differs between the annotation and aiDIVA!).
Instructions on how to use the annotation script and prepare the annotation resources and tools can be found here.
The HPO resources required for the prioritization step need to be downloaded before using aiDIVA. See the instructions (found in the doc/aidiva folder) for the relevant download links. You can place the generated files in the data/hpo_resources folder. The path to the files is specified in the configuration file make sure that it leads to the correct location.
There is one random forest model that is used in aiDIVA to predict the pathogenicity of a given variant. It is a combined model for SNV and inframe indel variants. The training data of the model consists of variants from Clinvar.
The scripts used to train the model can be found in the following GitHub repository: aiDIVA-Training
Frameshift variants will get a default score of 0.9, whereas synonymous variants always get the lowest score 0.0
A pretrained random forest model (aidiva-rf) using our current feature set can be found here. The latest model was trained using scikit-learn v1.3.2. The trained models of scikit-learn are version dependent.
aiDIVA supports the use of the official OpenAI API to send the requests to GPT-4o or GPT-4.1 for example. To use the OpenAI API you need an account and an API-Key that needs to be specified in the configuration file. Alternatively it is possible to set up your own local LLM (eg., LLama-8b, Mistral-12b, ...) and provide it locally as a Webservice. For an easy deployment you could use the NVIDIA NIM Containers see here for more details on how to do that. These local LLMs use the same python package for inference you just have to specify the port and URL where to find the local model in the configuration file.
You can download the pretrained meta models (aidiva-meta & aidiva-meta-rf) here. For these two models we used a random forest model that takes as features the ranking position and scores from the initial rankings (pathogenicity-based and evidence-based) plus the ranking result from the LLMs and the inheritance mode used in the evidence-based model.
This software is provided for research and informational purposes only.
It is not intended to provide medical, clinical, diagnostic, or therapeutic advice, and it must not be used as a substitute for professional judgment.
The software has not been validated, certified, or approved by any regulatory authority (including but not limited to the FDA, EMA, or other healthcare agencies).
It is not designed or intended for use in real-world medical or clinical decision-making.
Always consult qualified healthcare professionals for medical advice, diagnosis, or treatment.
This software is released under the MIT License; however, it may reference or interoperate with external datasets or databases that are subject to their own license restrictions or terms of use.
Users are solely responsible for ensuring they have the legal right to access and use any external databases required by this project, and must comply with all applicable terms set by the data providers.