Tokenizer and evaluation utilities for turning molecular structures into token sequences using Residual Vector Quantization (RVQ) codebooks. This repo focuses on:
- sampling OMol25 into CSV
- building RVQ codebooks for positions, forces, and energy
- serializing molecules into token sequences
- pushing tokenized shards and the tokenizer to the Hugging Face Hub
- evaluating trained models on OMol25 benchmarks (S2EF + 7 chemistry tasks)
omol_sample_1m_to_csv.py: stream a 1M sample fromcolabfit/OMol25_traininto CSV.build_rvq_codebooks.py: build RVQ codebooks (quantile or k-means) and evaluate reconstruction.serialize_molecules.py:MoleculeTokenizer+ CSV serialization to token sequences.push_to_hf.py: tokenize parquet shards and upload dataset shards to HF.push_tokenizer_to_hf.py: upload the tokenizer as a HF model repo.test_roundtrip.py: round-trip test through HF tokenizer and decode back to molecules.evaluate_omol25.py: evaluate trained models on OMol25 S2EF and chemistry benchmarks.omol25_train_sample_1k.csv: small sample for quick tests.codebook_mol_1m.pkl: prebuilt codebook used by the scripts.
Python 3.12+ is required.
# Option 1: uv (recommended)
uv sync
# Option 2: pip
python3 -m venv .venv
source .venv/bin/activate
pip install -e .For GPU support, install PyTorch with CUDA first:
pip install torch --index-url https://download.pytorch.org/whl/cu121Tokens are organized into sections so they can be shuffled for flexible conditioning:
[BOS]
[ATOMS] [Z=6] [Z=1] ... [ATOMS_END]
[POS] [P0:...] ... [NL] ... [POS_END]
[FORCE] [FX0:...] [FY0:...] [FZ0:...] [NL] ... [FORCE_END]
[ENERGY] [E0:...] [NL] [ENERGY_END]
[EOS]
[NL] is a newline token that separates per-atom rows in the position/force sections.
python3 serialize_molecules.py omol25_train_sample_1k.csv \
--codebook codebook_mol_1m.pkl \
--output tokenized_molecules.pkl \
--show-examples 1python3 omol_sample_1m_to_csv.pyThis streams a 1M-row sample from HF colabfit/OMol25_train and writes omol25_train_sample_1m.csv.
If you are authenticated with HF, the script will pick up HF_TOKEN automatically.
python3 build_rvq_codebooks.py omol25_train_sample_1m.csv \
--output codebook_mol_1m.pkl \
--method quantile \
--pos-levels 8 \
--n-levels 4 \
--codebook-size 256Quantile mode uses Morton (Z-order) binning for 3D positions. Use --method kmeans for k-means.
python3 serialize_molecules.py omol25_train_sample_1m.csv \
--codebook codebook_mol_1m.pkl \
--output tokenized_molecules.pklpush_to_hf.py expects local OMol25 parquet shards (downloaded with huggingface-cli download facebook/OMol25)
and writes train/validation shards to a dataset repo.
python3 push_to_hf.py /path/to/omol25_parquet_dir WillHeld/Tomol25 \
--codebook codebook_mol_1m.pklDataset repo: https://huggingface.co/datasets/WillHeld/Tomol25
python3 push_tokenizer_to_hf.py --repo-id WillHeld/marin-tomolTokenizer repo: https://huggingface.co/WillHeld/marin-tomol
python3 test_roundtrip.pyThis uses omol25_train_sample_1k.csv, codebook_mol_1m.pkl, and the HF tokenizer repo
WillHeld/marin-tomol.
# S2EF evaluation on validation set (with metrics)
python3 evaluate_omol25.py \
--model WillHeld/qwen3-omol \
--codebook codebook_mol_1m.pkl \
--split val \
--output predictions_val.npz
# Run all 7 chemistry evaluation tasks
python3 evaluate_omol25.py \
--model WillHeld/qwen3-omol \
--codebook codebook_mol_1m.pkl \
--run-evals \
--eval-output-dir eval_results
# Quick test with limited samples
python3 evaluate_omol25.py \
--model WillHeld/qwen3-omol \
--codebook codebook_mol_1m.pkl \
--split val \
--max-samples 100| Task | Description | Metric |
|---|---|---|
| S2EF | Structure to Energy and Forces | Energy MAE (meV/atom), Force MAE (meV/Å) |
| conformers | Identify lowest energy conformer | Accuracy (%) |
| distance_scaling | Energy vs intermolecular distance | MAE (meV) |
| ligand_strain | Bound vs relaxed ligand energy | MAE (meV) |
| ligand_pocket | Protein-ligand interaction energy | MAE (meV) |
| protonation | Protonated vs deprotonated energy (pKa proxy) | MAE (meV) |
| ie_ea | Ionization energy / electron affinity | MAE (meV) |
| spin_gap | Singlet-triplet energy gap | MAE (meV) |
Note: This tokenization scheme does not include charge or spin conditioning. The model predicts based on geometry alone, which works well for conformers, distance_scaling, ligand_strain, and ligand_pocket tasks. For ie_ea and spin_gap, the model relies on geometric differences between charge/spin states.