Skip to content

Official Repository of "Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy"

Notifications You must be signed in to change notification settings

juyeonnn/HEAVEN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

HEAVEN: Hybrid-Vector Retrieval for Visually Rich Documents

Official Repository for our paper "Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy"

πŸ”₯News

  • [2025/11] ViMDoc is now available on Hugging FaceπŸ€—!

ViMDoc Benchmark

ViMDoc (Visually-rich Long Multi-Document Retrieval Benchmark) for evaluating visual document retrieval under both multi-document and long-document settings.

from datasets import load_dataset
dataset = load_dataset("kaistdata/ViMDoc", split="ViMDoc")

Format

Sample datasets are available in benchmark/{ViMDoc,OpenDocVQA,ViDoSeek,M3DocVQA}. Each contains sample_query.json with queries and ground truth document IDs:

{
    "id": "<query_id>",
    "query": "<query_text>",
    "doc_ids": ["<document_id>"]
}

Sample document pages are stored in sample_pages/.

Note: Full datasets for other benchmarks are available from their original sources: OpenDocVQA | ViDoSeek | M3DocVQA

Indexing

(1) Encoding (Query/Document)

cd indexing/encode

# Visusal encoder
python encoder.py --encoder_type dse --folder ViMDoc
python encoder.py --encoder_type colqwen25 --folder ViMDoc

# Textual encoder
python ocr.py --device 0 --folder ViMDoc
python encoder.py --encoder_type nvembedv2 --folder ViMDoc
python encoder.py --encoder_type bge_m3_multi --folder ViMDoc

Available Encoders

Encoder Modality Type HF Checkpoint
colpali Visusal Multi-Vector vidore/colpali-v1.3
colqwen2 Visusal Multi-Vector vidore/colqwen2-v1.0
colqwen25 Visusal Multi-Vector vidore/colqwen2.5-v0.2
gme Visusal Single-Vector Alibaba-NLP/gme-Qwen2-VL-2B-Instruct
dse Visusal Single-Vector MrLight/dse-qwen2-2b-mrl-v1
visret Visusal Single-Vector openbmb/VisRAG-Ret
bge_m3_multi Textual (OCR) Multi-Vector BAAI/bge-m3
bge_m3 Textual (OCR) Single-Vector BAAI/bge-m3
nvembedv2 Textual (OCR) Single-Vector nvidia/NV-Embed-v2

(2) VS-Page Construction

cd indexing/vs-page

# Step 1: Document Layout Analysis
python DLA.py --dataset ViMDoc --device 0

# Step 2: Assemble & VS-page Encoding
python assemble.py \
    --dataset ViMDoc \
    --encoder_type dse \
    --reduction_factor 15 \
    --device 0

Retrieval - HEAVEN

Run the complete HEAVEN pipeline (Stage 1 + Stage 2):

cd retrieval/heaven

python heaven.py \
    --folder ViMDoc \
    --stage1_model dse \
    --stage2_model colqwen25 \
    --device 0 \
    --preprocess

Stage 1 Only :

python stage1.py --folder ViMDoc --model dse --alpha 0.1 --filter_ratio 0.5

Stage 2 Only :

# Preprocess queries first
python preprocess.py --folder ViMDoc --model colqwen25

# Run Stage 2
python stage2.py --folder ViMDoc --model colqwen25 --stage1_model dse --k 200 --filter_ratio 0.25

Structure

HEAVEN/
β”‚
β”œβ”€β”€ benchmark/                    
β”‚   β”œβ”€β”€ ViMDoc/                  
β”‚   β”œβ”€β”€ OpenDocVQA/            
β”‚   β”œβ”€β”€ ViDoSeek/                
β”‚   └── M3DocVQA/
β”‚       
β”œβ”€β”€ indexing/                      
β”‚   β”œβ”€β”€ encode/                  
β”‚   └── vs-page/
β”‚               
β”œβ”€β”€ retrieval/                    
β”‚   β”œβ”€β”€ baeline/                   
β”‚   └── heaven/
β”‚                
└── run.sh              

Citation

@article{kim2025hybrid,
  title={Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy},
  author={Kim, Juyeon and Lee, Geon and Choi, Dongwon and Kim, Taeuk and Shin, Kijung},
  journal={arXiv preprint arXiv:2510.22215},
  year={2025}
}

About

Official Repository of "Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages