Official Repository for our paper "Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy"
- [2025/11] ViMDoc is now available on Hugging Faceπ€!
ViMDoc (Visually-rich Long Multi-Document Retrieval Benchmark) for evaluating visual document retrieval under both multi-document and long-document settings.
from datasets import load_dataset
dataset = load_dataset("kaistdata/ViMDoc", split="ViMDoc")Sample datasets are available in benchmark/{ViMDoc,OpenDocVQA,ViDoSeek,M3DocVQA}.
Each contains sample_query.json with queries and ground truth document IDs:
{
"id": "<query_id>",
"query": "<query_text>",
"doc_ids": ["<document_id>"]
}Sample document pages are stored in sample_pages/.
Note: Full datasets for other benchmarks are available from their original sources: OpenDocVQA | ViDoSeek | M3DocVQA
cd indexing/encode
# Visusal encoder
python encoder.py --encoder_type dse --folder ViMDoc
python encoder.py --encoder_type colqwen25 --folder ViMDoc
# Textual encoder
python ocr.py --device 0 --folder ViMDoc
python encoder.py --encoder_type nvembedv2 --folder ViMDoc
python encoder.py --encoder_type bge_m3_multi --folder ViMDocAvailable Encoders
| Encoder | Modality | Type | HF Checkpoint |
|---|---|---|---|
colpali |
Visusal | Multi-Vector | vidore/colpali-v1.3 |
colqwen2 |
Visusal | Multi-Vector | vidore/colqwen2-v1.0 |
colqwen25 |
Visusal | Multi-Vector | vidore/colqwen2.5-v0.2 |
gme |
Visusal | Single-Vector | Alibaba-NLP/gme-Qwen2-VL-2B-Instruct |
dse |
Visusal | Single-Vector | MrLight/dse-qwen2-2b-mrl-v1 |
visret |
Visusal | Single-Vector | openbmb/VisRAG-Ret |
bge_m3_multi |
Textual (OCR) | Multi-Vector | BAAI/bge-m3 |
bge_m3 |
Textual (OCR) | Single-Vector | BAAI/bge-m3 |
nvembedv2 |
Textual (OCR) | Single-Vector | nvidia/NV-Embed-v2 |
cd indexing/vs-page
# Step 1: Document Layout Analysis
python DLA.py --dataset ViMDoc --device 0
# Step 2: Assemble & VS-page Encoding
python assemble.py \
--dataset ViMDoc \
--encoder_type dse \
--reduction_factor 15 \
--device 0Run the complete HEAVEN pipeline (Stage 1 + Stage 2):
cd retrieval/heaven
python heaven.py \
--folder ViMDoc \
--stage1_model dse \
--stage2_model colqwen25 \
--device 0 \
--preprocessStage 1 Only :
python stage1.py --folder ViMDoc --model dse --alpha 0.1 --filter_ratio 0.5Stage 2 Only :
# Preprocess queries first
python preprocess.py --folder ViMDoc --model colqwen25
# Run Stage 2
python stage2.py --folder ViMDoc --model colqwen25 --stage1_model dse --k 200 --filter_ratio 0.25HEAVEN/
β
βββ benchmark/
β βββ ViMDoc/
β βββ OpenDocVQA/
β βββ ViDoSeek/
β βββ M3DocVQA/
β
βββ indexing/
β βββ encode/
β βββ vs-page/
β
βββ retrieval/
β βββ baeline/
β βββ heaven/
β
βββ run.sh
@article{kim2025hybrid,
title={Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy},
author={Kim, Juyeon and Lee, Geon and Choi, Dongwon and Kim, Taeuk and Shin, Kijung},
journal={arXiv preprint arXiv:2510.22215},
year={2025}
}