A simple-by-design python lib that works in two modes: 1.) allows you to do cheap and lightweight retrieval and 2.) heavy GPU accelerated indexing using ModernColBERT - prithivida/modern_colbert_base_en_v1, (the 2nd best ColBERT in the world) into vectorDBs that offers native multi-vector support like Qdrant, Vespa and more..
- They are based on ModernBERT and efficient.
- Top 2 ColBERTs in the world are ModernBERT based.
- They support long context, 8K.
| Dataset / Model | GTE-ModernColBERT (Lighton AI) |
moderncolbert (Ours) | ColBERT-small (Answer AI, reproduced by Lighton) |
jina-colbert-v2 | ColBERTv2.0 Stanford |
|---|---|---|---|---|---|
| BEIR Average | 54.75 (🥇) | 54.19 (🥈) | 53.14 | 52.30 | 49.48 |
PS: Jina and Stanford did not run eval on CQADupstack and MSMARCO hence we skipped to make it fair.
- Dual Backend Architecture: ONNX for fast retrieval, PyTorch for GPU indexing
- Native Multi-Vector Support: Optimized for Qdrant's MaxSim comparator
- Smart Installation: Lightweight retrieval or heavy indexing based on your needs
- Production Ready: Separate deployment targets for different workloads
# Lightweight retrieval (ONNX + Qdrant)
pip install lateness
# Heavy indexing (PyTorch + Transformers + ONNX + Qdrant)
pip install lateness[index]Default Installation (ONNX Backend):
# pip install lateness
from lateness import ModernColBERT
colbert = ModernColBERT("prithivida/modern_colbert_base_en_v1",
max_query_len = 32,
max_doc_len = 300)
# Output:
# 🚀 Using ONNX backend Using ONNX backend (default, for GPU accelerated indexing, install lateness[index] and set LATENESS_USE_TORCH=true)
# 🔄 Downloading model: prithivida/modern_colbert_base_en_v1
# ✅ ONNX ColBERT loaded with providers: ['CPUExecutionProvider']
# Query max length: 256, Document max length: 300from lateness import ModernColBERT
documents = [
"PyTorch is an open-source machine learning framework that provides tensor computations with GPU acceleration and deep neural networks built on tape-based autograd system.",
"Kubernetes is a container orchestration platform that automates deployment, scaling, and management of containerized applications across clusters of machines.",
"REST APIs follow representational state transfer architectural style using HTTP methods like GET, POST, PUT, DELETE for stateless client-server communication.",
]
queries = [
"How to build real-time data pipelines?",
"What are the benefits of microservices?",
"How to implement efficient web APIs?"
]
colbert = ModernColBERT("prithivida/modern_colbert_base_en_v1",
max_query_len = 32,
max_doc_len = 300)
query_embeddings = colbert.encode_queries(queries)
doc_embeddings = colbert.encode_documents(documents)
scores = ModernColBERT.compute_similarity(query_embeddings, doc_embeddings)
print(scores)Index Installation (PyTorch Backend):
# pip install lateness[index]
import os
os.environ['LATENESS_USE_TORCH'] = 'true'
from lateness import ModernColBERT
colbert = ModernColBERT("prithivida/modern_colbert_base_en_v1")
# Output:
# 🚀 Using PyTorch backend (LATENESS_USE_TORCH=true)
# 🔄 Downloading model: prithivida/modern_colbert_base_en_v1
# Loading model from: /root/.cache/huggingface/hub/models--prithivida--modern_colbert_base_en_v1/...
# ✅ PyTorch ColBERT loaded on cuda
# Query max length: 256, Document max length: 300Complete Example with Qdrant:
For a complete working example with Qdrant integration, environment setup, and testing instructions, see the examples/qdrant folder.
The examples include:
- Environment setup and testing
- Local Qdrant server management
- Complete indexing and retrieval workflows
- Both ONNX and PyTorch backend examples
Retrieval Service (Lightweight)
pip install lateness- ONNX backend (fast CPU inference)
- Qdrant integration
- ~50MB total dependencies
- Perfect for user-facing search APIs
Indexing Service (Heavy)
pip install lateness[index]- PyTorch backend (GPU acceleration)
- Full Transformers support
- ~2GB+ dependencies
- Perfect for batch document processing
The package uses environment variables for backend control:
- Default behavior → ONNX backend (CPU retrieval)
LATENESS_USE_TORCH=true→ PyTorch backend (GPU indexing)
Note: PyTorch backend requires pip install lateness[index] to install PyTorch dependencies.
Apache License 2.0
Contributions welcome! Please check our contributing guidelines.