π DeepImageSearch: Benchmarking Multimodal Agents
for Context-Aware Image Retrieval in Visual Histories
π€ DISBench Dataset ο½ π Paper ο½ π Leaderboard ο½
- [Feb 12, 2026]: π Our paper is now available on arXiv.
- [Feb 12, 2026]: π Full codebase, DISBench dataset, and Leaderboard released.
DeepImageSearch represents a paradigm evolution in image retrieval, advancing from independent image matching to corpus-level contextual reasoning over visual histories. People capture thousands of photos over the years, forming rich episodic memories where information is distributed across temporal sequences rather than confined to single snapshots. Many real-world queries over such episodic memories cannot be resolved by evaluating each image independently. The target images can only be identified by exploring and reasoning over the entire image corpus. This corpus-level contextual reasoning makes agentic capabilities essential rather than auxiliary.
DISBench is the first benchmark designed for this task. Given a user's photo collection and a natural language query, agents must autonomously plan search trajectories, discover latent cross-image associations, and chain scattered visual evidence through multi-step exploration to return the exact set of qualifying images. The benchmark covers two reasoning patterns: Intra-Event queries that require locating a target event via contextual clues and then filtering within it, and Inter-Event queries that demand scanning across multiple events to find recurring elements under temporal or spatial constraints.
ImageSeeker is our modular agent framework that equips multimodal models with fine-grained tools and a dual-memory system for long-horizon navigation over visual histories.
| Paradigm | Visual Semantic Alignment | Identify after Thinking | Corpus Context Awareness |
|---|---|---|---|
| (a) Direct Retrieval | β | β | β |
| (b) Reasoning-intensive Retrieval | β | β | β |
| (c) DeepImageSearch (Ours) | β | β | β |
DISBench is constructed via a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, followed by rigorous human verification (6.1% retention rate from 2,000 candidates). See our paper for full details.
| Statistic | Value |
|---|---|
| Total Queries | 122 |
| Intra-Event / Inter-Event | 46.7% / 53.3% |
| Total Users | 57 |
| Total Photos | 109,467 |
| Avg. Targets per Query | 3.84 |
| Avg. History Span per User | 3.4 years |
ImageSeeker equips multimodal agents with capabilities tailored for visual history exploration:
- Tools for Navigation:
ImageSearch(multimodal retrieval),GetMetadata/FilterMetadata(spatiotemporal constraints),ViewPhotos(visual verification),WebSearch(external knowledge resolution) - Explicit State Memory: Named photo subsets that persist across reasoning steps, enabling iterative narrowing of candidates
- Compressed Context Memory: Session memory + working memory for maintaining reasoning state under context length limits
git clone https://github.com/RUC-NLPIR/DeepImageSearch.git
cd DeepImageSearch
pip install -r requirements.txtOption A: Hugging Face (Recommended)
huggingface-cli download RUC-NLPIR/DISBench --local-dir DISBenchOption B: Manual Download
cd DISBench
python download_images.py --output_dir images/After downloading, the dataset directory should look like:
DISBench/
βββ queries.jsonl # 122 annotated queries
βββ metadata/
β βββ {user_id}.jsonl # Per-user photo metadata
βββ images/
β βββ {user_id}/
β βββ {photo_id}.jpg # Photo files
βββ evaluate.py # Evaluation script
For detailed data format documentation, see DISBench/README.md.
- Serve the backbone model. For proprietary models, configure your API key. For open-source models, serve with vLLM:
# Example: serve Qwen3-VL-32B with vLLM
vllm serve Qwen/Qwen3-VL-32B-Instruct --tensor-parallel-size 4- Run the agent on DISBench:
python ImageSeeker/run_agent.py \
--dataset_path DISBench \
--model_name "claude-opus-4-5-20251101" \
--api_base_url "YOUR_API_BASE_URL" \
--api_key "YOUR_API_KEY" \
--embedding_model "qwen3-vl-embedding-8b" \
--max_turns 30 \
--output_dir results/claude-opus-4.5/- Run direct retrieval baseline:
python ImageSeeker/run_retriever.py \
--dataset_path DISBench \
--embedding_model "qwen3-vl-embedding-8b" \
--output_dir results/retriever/qwen3-vl-8b/Parameters:
--dataset_path: Path to the DISBench dataset directory.--model_name: Backbone model name for the agent.--api_base_url/--api_key: API endpoint and key for proprietary models.--embedding_model: Vision-language embedding model for theImageSearchtool.--max_turns: Maximum interaction turns per query (default: 30).--output_dir: Directory to save predictions and evaluation results.
Agent Evaluation β Computes set-level Exact Match, F1, Precision, Recall, and IoU:
python DISBench/evaluate.py \
--mode agent \
--dataset_path DISBench \
--prediction_path results/claude-opus-4.5/predictions.jsonl \
--output_dir results/claude-opus-4.5/This generates eval_samples.jsonl (per-query metrics) and eval_summary.json (aggregated results).
Retriever Baseline Evaluation β Computes MAP@k, Recall@k, and NDCG@k for k β {1, 3, 5, 10}:
python DISBench/evaluate.py \
--mode retriever \
--dataset_path DISBench \
--prediction_path results/retriever/qwen3-vl-8b/predictions.jsonl \
--output_dir results/retriever/qwen3-vl-8b/DeepImageSearch/
βββ README.md
βββ assets/ # Figures for README
βββ DISBench/ # Benchmark dataset and evaluation
β βββ README.md # Data format documentation
β βββ queries.jsonl # 122 annotated queries
β βββ metadata/ # Per-user photo metadata
β βββ images/ # Photo files (download separately)
β βββ evaluate.py # Evaluation script (agent + retriever)
β βββ download_images.py # Image download script
βββ ImageSeeker/ # Agent framework
β βββ README.md
β βββ run_agent.py # Run agent on DISBench
β βββ run_retriever.py # Run retrieval baseline
β βββ tools/ # Tool implementations
β βββ memory/ # Memory mechanisms
β βββ prompts/ # System prompts
βββ requirements.txt
If you find this work helpful, please cite our paper:
@misc{deng2026deepimagesearchbenchmarkingmultimodalagents,
title={DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories},
author={Chenlong Deng and Mengjie Deng and Junjie Wu and Dun Zeng and Teng Wang and Qingsong Xie and Jiadeng Huang and Shengjie Ma and Changwang Zhang and Zhaoxiang Wang and Jun Wang and Yutao Zhu and Zhicheng Dou},
year={2026},
eprint={2602.10809},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.10809}
}This project is released under the Apache 2.0 License. The DISBench dataset is constructed from YFCC100M and follows its Creative Commons licensing terms.
For any questions or feedback, please reach out to us at dengchenlong@ruc.edu.cn.



