Skip to content

Frontier Multimodal Foundation Models for Image and Video Understanding

License

Notifications You must be signed in to change notification settings

DAMO-NLP-SG/VideoLLaMA3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

bef3299 · Mar 21, 2025

History

53 Commits
Jan 22, 2025
Mar 21, 2025
Feb 17, 2025
Mar 12, 2025
Mar 20, 2025
Jan 21, 2025
Feb 7, 2025
Jan 20, 2025
Mar 20, 2025
Jan 20, 2025
Jan 20, 2025
Jan 27, 2025

Repository files navigation

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

hf_space hf_space hf_checkpoint
License Hits GitHub issues GitHub closed issues
hf_paper arXiv

💡 Some other multimodal-LLM projects from our team may interest you ✨.

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng*, Sicong Leng*, Hang Zhang*, Yifei Xin*, Xin Li*, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
github github arXiv

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing
github github arXiv

VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Sicong Leng*, Hang Zhang*, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing
github github arXiv

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
Sicong Leng*, Yun Xing*, Zesen Cheng*, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing
github github arXiv

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss
Zesen Cheng*, Hang Zhang*, Kehan Li*, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing
github github arXiv

📰 News

  • [2025.02.07] 🔥🔥 Release our re-captioned high-quality image-text dataset VL3-Syn7M.
  • [2025.01.26] 🔥🔥 As of Jan 26, VideoLLaMA3-7B is the best 7B-sized model on LVBench leaderboard.
  • [2025.01.24] 🔥🔥 As of Jan 24, VideoLLaMA3-7B is the best 7B-sized model on VideoMME leaderboard.
  • [2025.01.22] 👋👋 Release technical report of VideoLLaMA 3. If you have works closely related to VideoLLaMA 3 but not mentioned in the paper, feel free to let us know.
  • [2025.01.21] Release models and inference code of VideoLLaMA 3.

🌟 Introduction

VideoLLaMA 3 is a series of multimodal foundation models with frontier image and video understanding capacity.

💡Click here to show detailed performance on video benchmarks
💡Click here to show detailed performance on image benchmarks

🛠️ Requirements and Installation

Basic Dependencies:

  • Python >= 3.10
  • Pytorch >= 2.4.0
  • CUDA Version >= 11.8
  • transformers >= 4.46.3

Install required packages:

[Inference-only]

pip install torch==2.4.0 torchvision==0.17.0 --extra-index-url https://download.pytorch.org/whl/cu118

pip install flash-attn --no-build-isolation
pip install transformers==4.46.3 accelerate==1.0.1
pip install decord ffmpeg-python imageio opencv-python

[Training]

git clone https://github.com/DAMO-NLP-SG/VideoLLaMA3
cd VideoLLaMA3
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

🌎 Model Zoo

Model Base Model HF Link
VideoLLaMA3-7B Qwen2.5-7B DAMO-NLP-SG/VideoLLaMA3-7B
VideoLLaMA3-2B Qwen2.5-1.5B DAMO-NLP-SG/VideoLLaMA3-2B
VideoLLaMA3-7B-Image Qwen2.5-7B DAMO-NLP-SG/VideoLLaMA3-7B-Image
VideoLLaMA3-2B-Image Qwen2.5-1.5B DAMO-NLP-SG/VideoLLaMA3-2B-Image

We also upload the tuned vision encoder of VideoLLaMA3-7B for wider application:

Model Base Model HF Link
VideoLLaMA3-7B Vision Encoder siglip-so400m-patch14-384 DAMO-NLP-SG/VL3-SigLIP-NaViT

🤖 Inference

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

device = "cuda:0"
model_path = "DAMO-NLP-SG/VideoLLaMA3-7B"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    device_map={"": device},
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {"type": "video", "video": {"video_path": "./assets/cat_and_chicken.mp4", "fps": 1, "max_frames": 180}},
            {"type": "text", "text": "What is the cat doing?"},
        ]
    },
]

inputs = processor(
    conversation=conversation,
    add_system_prompt=True,
    add_generation_prompt=True,
    return_tensors="pt"
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=1024)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(response)

For more cases, please refer to examples.

CookBook

Checkout inference notebooks that demonstrate how to use VideoLLaMA3 on various applications such as single-image understanding, multi-image understanding, visual referring and grounding, video understanding, etc.

Notebooks Description
Image Understanding Demonstrations of using VideoLLaMA 3 for general image understanding, chart analysis, table understanding, document recognition, and visual code analysis
Multi-image Understanding Demonstrations of using VideoLLaMA 3 for multi-image comparison and understanding
Fine-grained Image Recognition & Understanding Demonstrations of using VideoLLaMA 3 for visual referring & grounding
Video Understanding Demonstrations of using VideoLLaMA 3 for general video understanding, long video understanding and temporal grounding

🤗 Demo

It is highly recommended to try our online demo first.

Otherwise, you can launch a gradio app locally:

python inference/launch_gradio_demo.py --model-path DAMO-NLP-SG/VideoLLaMA3-7B

options:
  --model-path MODEL_PATH, --model_path MODEL_PATH
  --server-port SERVER_PORT, --server_port SERVER_PORT
  	Optional. Port of the model server.
  --interface-port INTERFACE_PORT, --interface_port INTERFACE_PORT
  	Optional. Port of the gradio interface.
  --nproc NPROC
  	Optional. Number of model processes.

🗝️ Training

Step 1: Prepare training data

To use our training code, please organize the image and video data as you like under data_root, and then use one or more annotation files to record each conversation data and the corresponding image/video path. For example:

data_root
├── LLaVA-Video-178K
│   ├── video_1.mp4
│   └── ...
├── LLaVA-OneVision-Data
│   ├── image_1.jpg
│   └── ...
├── annotations_video.jsonl
├── annotations_image.jsonl
└── ...

The annotation files are consist of a list of dictionaries, where each item follows the following format:

[
    {
        "image": ["images/xxx.jpg"],
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWhat are the colors of the bus in the image?"
            },
            {
                "from": "gpt",
                "value": "The bus in the image is white and red."
            },
            ...
        ]
    },
    {
        "video": ["videos/xxx.mp4"],
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nWhat are the main activities that take place in the video?"
            },
            {
                "from": "gpt",
                "value": "The main activities that take place in the video are the preparation of camera equipment by a man, a group of men riding a helicopter, and a man sailing a boat through the water."
            },
            ...
        ]
    },
    ...
]

For loading and memory efficiency, we recommend to use .jsonl files with huggingface datasets format.

Step 2: (Optional) Convert HF checkpoint

If you want to finetune VideoLLaMA3 on your own data using this codebase, please first convert the checkpoints from huggingface to local format. For example:

python scripts/convert_hf_checkpoint.py --model_path DAMO-NLP-SG/VideoLLaMA3-7B --save_path weights/videollama3_7b_local

Step 3: Prepare training script

We provide some templates in scripts/train for all stages. You can modify the variables to fit your settings of data and models based on them. For example:

  --data_folder ./datasets \
  --data_path ./datasets/annotations_video.jsonl ./datasets/annotations_image.jsonl \
  --model_path Qwen/Qwen2.5-1.5B-Instruct \
  --vision_encoder DAMO-NLP-SG/SigLIP-NaViT \

For finetuneing, --model_path is the path to the converted checkpoint as described in step 2.

Step 4: Start training

Now you can start training with your training scripts:

# VideoLLaMA3 Stage 1
bash scripts/train/stage1_2b.sh
# VideoLLaMA3 Stage 2
bash scripts/train/stage2_2b.sh

Some tips about CUDA OOM error:

  • Please try the latest main branch, where we optimize the memory consumption in this commit.
  • Try DeepSpeed ZeRO-2/3 by passing --deepspeed scripts/zero2.json / zero3.json.
  • Reduce the max number of visual tokens (high-resolution images and videos will be automatically downsampled to fit this length) and max length of sequences (sequences longer than this will be truncated) by setting --mm_max_length and --model_max_length, respectively.
  • Reduce the local batch size, i.e., LOCAL_BATCH_SIZE in the training script. You can adjust the above hyperparameters according to the available GPU memory and number of GPUs to make the training fits your hardware.

✅ Evaluation

Step 1: Prepare evaluation data

First, please download the corresponding data according to the official instructions and organize it into the following format:

Click here to view the dataset directory organization
benchmarks
└── video
│   ├── activitynet_qa
│   │   ├── all_test
│   │   ├── test_a.json
│   │   └── test_q.json
│   ├── charades
│   │   ├── Charades_v1
│   │   └── charades_annotations_test-random_prompt.json
│   ├── egoschema
│   │   ├── good_clips_git
│   │   └── questions.json
│   ├── longvideobench
│   │   ├── lvb_val.json
│   │   ├── subtitles
│   │   └── videos
│   ├── lvbench
│   │   ├── video
│   │   └── video_info.meta.jsonl
│   ├── mlvu
│   │   ├── json
│   │   └── video
│   ├── mvbench
│   │   ├── json
│   │   └── video
│   ├── nextqa
│   │   ├── map_vid_vidorID.json
│   │   ├── NExTVideo
│   │   └── test.csv
│   ├── perception_test
│   │   ├── mc_question_test.json
│   │   └── videos
│   ├── tempcompass
│   │   ├── captioning
│   │   ├── caption_matching
│   │   ├── multi-choice
│   │   ├── videos
│   │   └── yes_no
│   ├── videomme
│   │   ├── subtitles
│   │   ├── test-00000-of-00001.parquet
│   │   └── videos

Step 2: Start evaluation

bash scripts/eval/eval_video.sh ${MODEL_PATH} ${BENCHMARKS} ${NUM_NODES} ${NUM_GPUS}

You can change the directory of benchmarks and outputs via DATA_ROOT and SAVE_DIR in the evaluation script. Please check the scripts for more detailed usage.

Step 3: Add new benchmark

Coming soon...

📑 Citation

If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:

@article{damonlpsg2025videollama3,
  title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
  author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
  journal={arXiv preprint arXiv:2501.13106},
  year={2025},
  url = {https://arxiv.org/abs/2501.13106}
}

@article{damonlpsg2024videollama2,
  title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
  author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
  journal={arXiv preprint arXiv:2406.07476},
  year={2024},
  url = {https://arxiv.org/abs/2406.07476}
}

@article{damonlpsg2023videollama,
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  journal = {arXiv preprint arXiv:2306.02858},
  year = {2023},
  url = {https://arxiv.org/abs/2306.02858}
}

👍 Acknowledgement

Our VideoLLaMA3 is built on top of SigLip and Qwen2.5. We also learned a lot from the implementation of LLaVA-OneVision, InternVL2, and Qwen2VL. Besides, our VideoLLaMA3 benefits from tons of open-source efforts. We sincerely appreciate these efforts and compile a list in ACKNOWLEDGEMENT.md to express our gratitude. If your work is used in VideoLLaMA3 but not mentioned in either this repo or the technical report, feel free to let us know ❤️.

🔒 License

This project is released under the Apache 2.0 license as found in the LICENSE file. The service is a research preview intended for non-commercial use ONLY, subject to the model Licenses of Qwen, Terms of Use of the data generated by OpenAI and Gemini, and Privacy Practices of ShareGPT. Please get in touch with us if you find any potential violations.

About

Frontier Multimodal Foundation Models for Image and Video Understanding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published