🔍 DeepImageSearch: Benchmarking Multimodal Agents
for Context-Aware Image Retrieval in Visual Histories

🤗 DISBench Dataset ｜ 📄 Paper ｜ 🏆 Leaderboard ｜

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 Latest News

[Feb 12, 2026]: 📄 Our paper is now available on arXiv.
[Feb 12, 2026]: 🚀 Full codebase, DISBench dataset, and Leaderboard released.

💡 What is DeepImageSearch?

DeepImageSearch represents a paradigm evolution in image retrieval, advancing from independent image matching to corpus-level contextual reasoning over visual histories. People capture thousands of photos over the years, forming rich episodic memories where information is distributed across temporal sequences rather than confined to single snapshots. Many real-world queries over such episodic memories cannot be resolved by evaluating each image independently. The target images can only be identified by exploring and reasoning over the entire image corpus. This corpus-level contextual reasoning makes agentic capabilities essential rather than auxiliary.

DISBench is the first benchmark designed for this task. Given a user's photo collection and a natural language query, agents must autonomously plan search trajectories, discover latent cross-image associations, and chain scattered visual evidence through multi-step exploration to return the exact set of qualifying images. The benchmark covers two reasoning patterns: Intra-Event queries that require locating a target event via contextual clues and then filtering within it, and Inter-Event queries that demand scanning across multiple events to find recurring elements under temporal or spatial constraints.

ImageSeeker is our modular agent framework that equips multimodal models with fine-grained tools and a dual-memory system for long-horizon navigation over visual histories.

Evolution of Image Retrieval Paradigms

Paradigm	Visual Semantic Alignment	Identify after Thinking	Corpus Context Awareness
(a) Direct Retrieval	✅	❌	❌
(b) Reasoning-intensive Retrieval	✅	✅	❌
(c) DeepImageSearch (Ours)	✅	✅	✅

📁 DISBench at a Glance

DISBench is constructed via a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, followed by rigorous human verification (6.1% retention rate from 2,000 candidates). See our paper for full details.

Statistic	Value
Total Queries	122
Intra-Event / Inter-Event	46.7% / 53.3%
Total Users	57
Total Photos	109,467
Avg. Targets per Query	3.84
Avg. History Span per User	3.4 years

✨ ImageSeeker Framework

ImageSeeker equips multimodal agents with capabilities tailored for visual history exploration:

Tools for Navigation: ImageSearch (multimodal retrieval), GetMetadata / FilterMetadata (spatiotemporal constraints), ViewPhotos (visual verification), WebSearch (external knowledge resolution)
Explicit State Memory: Named photo subsets that persist across reasoning steps, enabling iterative narrowing of candidates
Compressed Context Memory: Session memory + working memory for maintaining reasoning state under context length limits

📊 Main Results

Agentic Evaluation

Direct Retrieval Baseline

🔧 Installation

git clone https://github.com/RUC-NLPIR/DeepImageSearch.git
cd DeepImageSearch
pip install -r requirements.txt

Dataset Preparation

Option A: Hugging Face (Recommended)

huggingface-cli download RUC-NLPIR/DISBench --local-dir DISBench

Option B: Manual Download

cd DISBench
python download_images.py --output_dir images/

After downloading, the dataset directory should look like:

DISBench/
├── queries.jsonl              # 122 annotated queries
├── metadata/
│   └── {user_id}.jsonl        # Per-user photo metadata
├── images/
│   └── {user_id}/
│       └── {photo_id}.jpg     # Photo files
└── evaluate.py                # Evaluation script

For detailed data format documentation, see DISBench/README.md.

🏃 Quick Start

Running ImageSeeker Agent

Serve the backbone model. For proprietary models, configure your API key. For open-source models, serve with vLLM:

# Example: serve Qwen3-VL-32B with vLLM
vllm serve Qwen/Qwen3-VL-32B-Instruct --tensor-parallel-size 4

Run the agent on DISBench:

python ImageSeeker/run_agent.py \
    --dataset_path DISBench \
    --model_name "claude-opus-4-5-20251101" \
    --api_base_url "YOUR_API_BASE_URL" \
    --api_key "YOUR_API_KEY" \
    --embedding_model "qwen3-vl-embedding-8b" \
    --max_turns 30 \
    --output_dir results/claude-opus-4.5/

Run direct retrieval baseline:

python ImageSeeker/run_retriever.py \
    --dataset_path DISBench \
    --embedding_model "qwen3-vl-embedding-8b" \
    --output_dir results/retriever/qwen3-vl-8b/

Parameters:

--dataset_path: Path to the DISBench dataset directory.
--model_name: Backbone model name for the agent.
--api_base_url / --api_key: API endpoint and key for proprietary models.
--embedding_model: Vision-language embedding model for the ImageSearch tool.
--max_turns: Maximum interaction turns per query (default: 30).
--output_dir: Directory to save predictions and evaluation results.

Evaluation

Agent Evaluation — Computes set-level Exact Match, F1, Precision, Recall, and IoU:

python DISBench/evaluate.py \
    --mode agent \
    --dataset_path DISBench \
    --prediction_path results/claude-opus-4.5/predictions.jsonl \
    --output_dir results/claude-opus-4.5/

This generates eval_samples.jsonl (per-query metrics) and eval_summary.json (aggregated results).

Retriever Baseline Evaluation — Computes MAP@k, Recall@k, and NDCG@k for k ∈ {1, 3, 5, 10}:

python DISBench/evaluate.py \
    --mode retriever \
    --dataset_path DISBench \
    --prediction_path results/retriever/qwen3-vl-8b/predictions.jsonl \
    --output_dir results/retriever/qwen3-vl-8b/

📁 Repository Structure

DeepImageSearch/
├── README.md
├── assets/                    # Figures for README
├── DISBench/                  # Benchmark dataset and evaluation
│   ├── README.md              # Data format documentation
│   ├── queries.jsonl          # 122 annotated queries
│   ├── metadata/              # Per-user photo metadata
│   ├── images/                # Photo files (download separately)
│   ├── evaluate.py            # Evaluation script (agent + retriever)
│   └── download_images.py     # Image download script
├── ImageSeeker/               # Agent framework
│   ├── README.md
│   ├── run_agent.py           # Run agent on DISBench
│   ├── run_retriever.py       # Run retrieval baseline
│   ├── tools/                 # Tool implementations
│   ├── memory/                # Memory mechanisms
│   └── prompts/               # System prompts
└── requirements.txt

📄 Citation

If you find this work helpful, please cite our paper:

@misc{deng2026deepimagesearchbenchmarkingmultimodalagents,
  title={DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories}, 
  author={Chenlong Deng and Mengjie Deng and Junjie Wu and Dun Zeng and Teng Wang and Qingsong Xie and Jiadeng Huang and Shengjie Ma and Changwang Zhang and Zhaoxiang Wang and Jun Wang and Yutao Zhu and Zhicheng Dou},
  year={2026},
  eprint={2602.10809},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.10809}
}

📄 License

This project is released under the Apache 2.0 License. The DISBench dataset is constructed from YFCC100M and follows its Creative Commons licensing terms.

📞 Contact

For any questions or feedback, please reach out to us at dengchenlong@ruc.edu.cn.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 DeepImageSearch: Benchmarking Multimodal Agents
for Context-Aware Image Retrieval in Visual Histories

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 Latest News

💡 What is DeepImageSearch?

Evolution of Image Retrieval Paradigms

📁 DISBench at a Glance

✨ ImageSeeker Framework

📊 Main Results

Agentic Evaluation

Direct Retrieval Baseline

🔧 Installation

Dataset Preparation

🏃 Quick Start

Running ImageSeeker Agent

Evaluation

📁 Repository Structure

📄 Citation

📄 License

📞 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
DISBench		DISBench
ImageSeeker		ImageSeeker
assets		assets
README.md		README.md

RUC-NLPIR/DeepImageSearch

Folders and files

Latest commit

History

Repository files navigation

🔍 DeepImageSearch: Benchmarking Multimodal Agentsfor Context-Aware Image Retrieval in Visual Histories

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 Latest News

💡 What is DeepImageSearch?

Evolution of Image Retrieval Paradigms

📁 DISBench at a Glance

✨ ImageSeeker Framework

📊 Main Results

Agentic Evaluation

Direct Retrieval Baseline

🔧 Installation

Dataset Preparation

🏃 Quick Start

Running ImageSeeker Agent

Evaluation

📁 Repository Structure

📄 Citation

📄 License

📞 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

🔍 DeepImageSearch: Benchmarking Multimodal Agents
for Context-Aware Image Retrieval in Visual Histories

Packages