Factorized Learning for Temporally Grounded Video-Language Models

Wenzheng Zeng¹, Difei Gao¹, Mike Zheng Shou¹, Hwee Tou Ng¹,

¹National University of Singapore

ICCV 2025

This repository contains the official implementation of the ICCV 2025 paper "Factorized Learning for Temporally Grounded Video-Language Models".

🔆 Highlights

Model: We propose a new framework $D^2\mathrm{VLM}$, where we decompose the generation objective into a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens to emphasize explicit event-level visual semantic capture.
Training Algorithm: We introduce Factorized Preference Optimization (FPO) that explicitly addresses both temporal grounding and textual response. A factorized data synthesis approach is also designed to support FPO.
Performance: Our method consistently outperforms SOTA methods across various tasks.
Open Source: The camera-ready paper and the source code will be released soon.

🎓 Citation

If you find our work useful in your research, please consider to cite our paper:

@inproceedings{d2vlm,
  title={Factorized Learning for Temporally Grounded Video-Language Models},
  author={Zeng, Wenzheng and Gao, Difei and Shou, Mike Zheng and Ng, Hwee Tou},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
pictures		pictures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Factorized Learning for Temporally Grounded Video-Language Models

ICCV 2025

🔆 Highlights

🎓 Citation

About

Uh oh!

Releases

Packages

nusnlp/d2vlm

Folders and files

Latest commit

History

Repository files navigation

Factorized Learning for Temporally Grounded Video-Language Models

ICCV 2025

🔆 Highlights

🎓 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages