Skip to content

A ModernColBERT lib that allows you to do 1.) Cheap and lightweight retrieval and 2.) heavy GPU accelerated indexing into vectorDBs that offers native multi-vector support like Qdrant, Vespa and more..

License

Notifications You must be signed in to change notification settings

PrithivirajDamodaran/lateness

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lateness

A simple-by-design python lib that works in two modes: 1.) allows you to do cheap and lightweight retrieval and 2.) heavy GPU accelerated indexing using ModernColBERT - prithivida/modern_colbert_base_en_v1, (the 2nd best ColBERT in the world) into vectorDBs that offers native multi-vector support like Qdrant, Vespa and more..

Why Modern-Colbert Models ?

  • They are based on ModernBERT and efficient.
  • Top 2 ColBERTs in the world are ModernBERT based.
  • They support long context, 8K.
Dataset / Model GTE-ModernColBERT
(Lighton AI)
moderncolbert (Ours) ColBERT-small
(Answer AI, reproduced by Lighton)
jina-colbert-v2 ColBERTv2.0
Stanford
BEIR Average 54.75 (🥇) 54.19 (🥈) 53.14 52.30 49.48

PS: Jina and Stanford did not run eval on CQADupstack and MSMARCO hence we skipped to make it fair.

Features

  • Dual Backend Architecture: ONNX for fast retrieval, PyTorch for GPU indexing
  • Native Multi-Vector Support: Optimized for Qdrant's MaxSim comparator
  • Smart Installation: Lightweight retrieval or heavy indexing based on your needs
  • Production Ready: Separate deployment targets for different workloads

Quick Start

Installation

# Lightweight retrieval (ONNX + Qdrant)
pip install lateness

# Heavy indexing (PyTorch + Transformers + ONNX + Qdrant)
pip install lateness[index]

Backend Selection

Basic Usage

Default Installation (ONNX Backend):

# pip install lateness
from lateness import ModernColBERT
colbert = ModernColBERT("prithivida/modern_colbert_base_en_v1",
                        max_query_len = 32,
                        max_doc_len = 300)
# Output:
# 🚀 Using ONNX backend Using ONNX backend (default, for GPU accelerated indexing, install lateness[index] and set LATENESS_USE_TORCH=true)
# 🔄 Downloading model: prithivida/modern_colbert_base_en_v1
# ✅ ONNX ColBERT loaded with providers: ['CPUExecutionProvider']
# Query max length: 256, Document max length: 300
from lateness import ModernColBERT

documents = [
    "PyTorch is an open-source machine learning framework that provides tensor computations with GPU acceleration and deep neural networks built on tape-based autograd system.",
    "Kubernetes is a container orchestration platform that automates deployment, scaling, and management of containerized applications across clusters of machines.",
    "REST APIs follow representational state transfer architectural style using HTTP methods like GET, POST, PUT, DELETE for stateless client-server communication.",
]

queries = [
    "How to build real-time data pipelines?",
    "What are the benefits of microservices?",
    "How to implement efficient web APIs?"
]

colbert = ModernColBERT("prithivida/modern_colbert_base_en_v1",
                        max_query_len = 32,
                        max_doc_len = 300)


query_embeddings = colbert.encode_queries(queries)
doc_embeddings = colbert.encode_documents(documents)
scores = ModernColBERT.compute_similarity(query_embeddings, doc_embeddings)
print(scores)

Index Installation (PyTorch Backend):

# pip install lateness[index]
import os
os.environ['LATENESS_USE_TORCH'] = 'true'
from lateness import ModernColBERT

colbert = ModernColBERT("prithivida/modern_colbert_base_en_v1")
# Output:
# 🚀 Using PyTorch backend (LATENESS_USE_TORCH=true)
# 🔄 Downloading model: prithivida/modern_colbert_base_en_v1
# Loading model from: /root/.cache/huggingface/hub/models--prithivida--modern_colbert_base_en_v1/...
# ✅ PyTorch ColBERT loaded on cuda
# Query max length: 256, Document max length: 300

Complete Example with Qdrant:

For a complete working example with Qdrant integration, environment setup, and testing instructions, see the examples/qdrant folder.

The examples include:

  • Environment setup and testing
  • Local Qdrant server management
  • Complete indexing and retrieval workflows
  • Both ONNX and PyTorch backend examples

Architecture

Two Deployment Models

Retrieval Service (Lightweight)

pip install lateness
  • ONNX backend (fast CPU inference)
  • Qdrant integration
  • ~50MB total dependencies
  • Perfect for user-facing search APIs

Indexing Service (Heavy)

pip install lateness[index]
  • PyTorch backend (GPU acceleration)
  • Full Transformers support
  • ~2GB+ dependencies
  • Perfect for batch document processing

Backend Selection

The package uses environment variables for backend control:

  • Default behavior → ONNX backend (CPU retrieval)
  • LATENESS_USE_TORCH=true → PyTorch backend (GPU indexing)

Note: PyTorch backend requires pip install lateness[index] to install PyTorch dependencies.

License

Apache License 2.0

Contributing

Contributions welcome! Please check our contributing guidelines.

About

A ModernColBERT lib that allows you to do 1.) Cheap and lightweight retrieval and 2.) heavy GPU accelerated indexing into vectorDBs that offers native multi-vector support like Qdrant, Vespa and more..

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages