This repository adapts and extends the upstream nomic-ai/contrastors to finetune and pretrain Nomic text embedding models on chemistry-specific scientific corpora. The resulting models are named ChEmbed and are optimized for chemical literature retrieval and related tasks.
- Upstream codebase: nomic-ai/contrastors
- Chemistry data curation and pair mining repo: HSILA/Chemistry-Data
- Chemistry datasets collection (query–passage pairs): Chemical Data
- Trained ChEmbed models: ChEmbed
- Paper: ChEmbed: Enhancing Chemical Literature Search Through Domain-Specific Text Embeddings (arXiv:2508.01643)
The training was executed on Compute Canada (Narval) using 4× A100 GPUs.
- Built on FlashAttention for efficient training
- Multi-GPU support (PyTorch Distributed / Deepspeed)
- GradCache for large batches
- Hugging Face Transformers integration (BERT-style encoders)
- CLIP-style contrastive objectives with paired and triplet data
- Create and activate a virtualenv
python3 -m venv .venv
source .venv/bin/activate- Install core build tools and dependencies
pip install wheel packaging ninja setuptools torch
pip install -r requirements.txt
pip install -e .- Install FlashAttention from source if your environment requires it; refer to
wheels/README.mdfor more details.
Use the provided batch script training-script.sh, which sets up the environment and launches training:
sbatch training-script.shWhat it does:
- Loads site modules (CUDA 12.2, GCC 12.3, Python 3.10, etc.)
- Configures offline caches for Hugging Face (
HF_HUB_OFFLINE=1,HF_DATASETS_OFFLINE=1) - Creates a venv on
$SLURM_TMPDIR - Installs wheels from the local
wheels/directory (seewheels/README.md) andrequirements-cc.txt - Launches training with 4 GPUs via
torchrun
If you do not have access to the internet, you can build the environment manually:
- Place all required wheels under
wheels/and followwheels/README.mdfor downloading the correct dependencies. - Point
setup.pyto userequirements-cc.txtinstead ofrequirements.txtwhen installing on Narval.
with open("requirements.txt") as f:
requirements = f.read().splitlines()Change requirements.txt to requirements-cc.txt if you want to default to the Compute Canada set when running pip install -e . in an offline context.
We rewrote the dataloader to support both local file paths and S3-compatible endpoints. Dataset configurations live in:
src/contrastors/configs/data/chem_finetune_pairs.yamlsrc/contrastors/configs/data/chem_finetune_triplets.yaml
You can use absolute local paths (e.g., /path/.../shard-{00000..00004}.jsonl.gz) or S3 URIs (with proper fsspec configuration). See the YAMLs above for concrete examples.
- Finetune (starting from
nomic-ai/nomic-embed-text-v1):src/contrastors/configs/train/chem_contrastive_finetune.yaml - Pretrain (starting from
nomic-ai/nomic-embed-text-v1-unsupervised):src/contrastors/configs/train/chem_contrastive_pretrain.yaml
Example commands (from src/contrastors):
torchrun --nproc-per-node=4 train.py \
--config=configs/train/chem_contrastive_finetune.yaml \
--dtype=bf16torchrun --nproc-per-node=4 train.py \
--config=configs/train/chem_contrastive_pretrain.yaml \
--dtype=bf16Key fields in model_args within these configs:
model_name: base checkpoint (finetune:nomic-ai/nomic-embed-text-v1, pretrain:nomic-ai/nomic-embed-text-v1-unsupervised)tokenizer_name:bert-base-uncasedor the chemistry-adaptedBASF-AI/ChemVocabtrainable_params: controls which parameters update (see below)
To better represent chemical entities (e.g., IUPAC names), we adapted the tokenizer by inserting chemistry-specific tokens into previously unused slots and released it as BASF-AI/ChemVocab. In the paper, we describe adding roughly 900 specialized tokens and maintaining long-context support, which improves retrieval over general-purpose tokenizers. See the paper for details: arXiv:2508.01643.
Training schemes used with the adapted tokenizer:
- Vanilla: use the original tokenizer and finetune as usual.
- Full: switch to
BASF-AI/ChemVocaband train the entire model end-to-end. - Plug: switch to
BASF-AI/ChemVocaband only learn the newly added “unused” token embeddings, keeping the rest fixed. - Progressive: stage-wise training; start with only the new token embeddings, then open up the rest of the model (and finally everything) for stable adaptation.
How to set via model_args.trainable_params:
all: train everything (use for Full; final stage of Progressive; also Vanilla when keeping the original tokenizer)unused_only: freeze the whole model except the word-embedding matrix; within the embedding layer, only gradients for previously unused token IDs are kept. Use for Plug; stage 1 of Progressiveunused_and_rest: train the whole model but freeze gradients for special tokens and already-used tokens in the embedding layer; the rest of the model updates normally. Useful as stage 2 in Progressive
You can switch schemes by editing model_args.tokenizer_name and model_args.trainable_params in the YAML configs.
- Chemistry data collection and pair mining code: HSILA/Chemistry-Data
- Curated datasets (pairs/triplets): Chemical Data
- Trained ChEmbed models: ChEmbed
This code is licensed under the Apache 2.0 License. See model cards for individual model licenses.
We extend thanks to the authors and maintainers of nomic-ai/contrastors, FlashAttention, GradCache, and the HuggingFace ecosystem. Additionally, we would like to thank the BASF and Digital Research Alliance of Canada for providing the computing resources for this project.
If you find ChEmbed or this codebase useful, please cite:
@misc{kasmaee2025chembed,
title = {ChEmbed: Enhancing Chemical Literature Search Through Domain-Specific Text Embeddings},
author = {Ali Shiraee Kasmaee and Mohammad Khodadad and Mahdi Astaraki and Mohammad Arshi Saloot and Nicholas Sherck and Hamidreza Mahyar and Soheila Samiee},
year = {2025},
eprint = {2508.01643},
archivePrefix= {arXiv},
primaryClass = {cs.IR},
doi = {10.48550/arXiv.2508.01643},
url = {https://www.arxiv.org/abs/2508.01643}
}You may also find the upstream Nomic Embed papers helpful:
@misc{nussbaum2024nomic,
title = {Nomic Embed: Training a Reproducible Long Context Text Embedder},
author = {Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
year = {2024},
eprint = {2402.01613},
archivePrefix= {arXiv},
primaryClass = {cs.CL}
}
@misc{nussbaum2024nomicembedvisionexpanding,
title = {Nomic Embed Vision: Expanding the Latent Space},
author = {Zach Nussbaum and Brandon Duderstadt and Andriy Mulyar},
year = {2024},
eprint = {2406.18587},
archivePrefix= {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2406.18587}
}