CARDBiomedBench 🧬

CARDBiomedBench is a benchmarking suite for evaluating Large Language Models on complex biomedical question-answering tasks. For detailed methodology and results, please refer to our paper CARDBiomedBench: Benchmarking Large Language Model Performance Gaps in Biomedical Research. The CARDBiomedBench dataset is hosted on Hugging Face🤗.

Setup Environment

Create a Conda environment with the necessary dependencies:

source scripts/setup_conda_env.sh

Setup Benchmark

Prepare directories, configure environment variables, and download the dataset:

python scripts/setup_benchmark_files.py

Run Benchmark

Hands-Free Execution

Run the benchmark end-to-end:

python scripts/run_benchmark.py --run_responses --run_metrics --run_graphs

Running with Slurm Cluster

If using a Slurm cluster, submit jobs for each model with example commands specified in the slurm_commands.txt file.

Customizing the Configuration

Modify configs/default_config.yaml to adjust settings:

Dataset Settings: Change dataset split or name.
Prompts: Customize system or grading prompts.
Model Parameters: Adjust max_tokens, temperature, etc.
Paths: Modify directories for data and outputs.
Models: Add or remove models.
Metrics: Enable or disable evaluation metrics.

Project Structure

configs/: Configuration files and environment variables.
data/: Dataset files.
results/: Output results.
logs/: Log files.
scripts/: Scripts for setup, execution, and utilities.

Citing

@article {Bianchi2025.01.15.633272,
	author = {Bianchi, Owen and Willey, Maya and Avarado, Chelsea X and Danek, Benjamin and Khani, Marzieh and Kuznetsov, Nicole and Dadu, Anant and Shah, Syed and Koretsky, Mathew J and Makarious, Mary B and Weller, Cory and Levine, Kristin S and Kim, Sungwon and Jarreau, Paige and Vitale, Dan and Marsan, Elise and Iwaki, Hirotaka and Leonard, Hampton and Bandres-Ciga, Sara and Singleton, Andrew B and Nalls, Mike A. and Mokhtari, Shekoufeh and Khashabi, Daniel and Faghri, Faraz},
	title = {CARDBiomedBench: A Benchmark for Evaluating Large Language Model Performance in Biomedical Research},
	year = {2025},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/01/19/2025.01.15.633272},
	journal = {bioRxiv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 283 Commits
configs		configs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
slurm_commands.txt		slurm_commands.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CARDBiomedBench 🧬

Setup Environment

Setup Benchmark

Run Benchmark

Hands-Free Execution

Running with Slurm Cluster

Customizing the Configuration

Project Structure

Citing

About

Releases

Packages

Contributors 3

Languages

License

NIH-CARD/CARDBiomedBench

Folders and files

Latest commit

History

Repository files navigation

CARDBiomedBench 🧬

Setup Environment

Setup Benchmark

Run Benchmark

Hands-Free Execution

Running with Slurm Cluster

Customizing the Configuration

Project Structure

Citing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages