Skip to content

official repo for paper "[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs"

License

Notifications You must be signed in to change notification settings

THU-MIG/VTC-CLS

Repository files navigation

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

This is the official implementation of VTC-CLS, a state-of-the-art effective method for training-free visual token compression in Multimodal Large Language Models. Visualization of VTC-CLS Our VTC-CLS is simple and can serve as a plug-and-play method to accelerate the inference of MLLMs in a training free manner, showing high practicality.

News

  • [2024.12.10] we open-sourced our code!

Environmental Setup

conda create -n VTC-CLS python=3.10
pip install -r requirements.txt

Performance

We tested our VTC-CLS method on various models with different compression ratios, and display LLaVA results here. Compared with existing methods including FastV and LLaVA-prumerge, our method is state-of-the-art in training-free manner.

Efficiency

We measure the evaluation time and show our method can effectively speed-up the inference process of MLLMs. We display the inference time of LLaVA-v1.5-7B on some test datasets before and after applying our VTC-CLS method.

Evaluation

You can simply run scripts under ./scripts/v1_5/eval. You should specify the start layer and the token num to keep in command line(except for reproduce).

GQA

  1. Download the data and evaluation scripts following the official instructions and put under ../data/gqa/data. You may need to modify eval.py as this due to the missing assets in the GQA v1.2 release.
  2. Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/gqa.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/gqa.sh

ScienceQA

  1. Under ../data/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA repo.
  2. Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/sqa.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/sqa.sh

TextVQA

  1. Download TextVQA_0.5.1_val.json and images and extract to ../data/textvqa.
  2. Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/textvqa.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/textvqa.sh

POPE

  1. Download coco from POPE and put under ../data.
  2. Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/pope.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/pope.sh

MMBench

  1. Download mmbench_dev_20230712.tsv and put under ../data/mmbench.
  2. Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/mmbench.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/mmbench.sh
  1. Submit the results to the evaluation server: ../data/eval/mmbench/answers_upload/mmbench_dev_20230712.

MMBench-CN

  1. Download mmbench_dev_cn_20231003.tsv and put under ../data/mmbench.
  2. Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/mmbench_cn.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/mmbench_cn.sh
  1. Submit the results to the evaluation server: ../data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003.

SEED-Bench

  1. Following the official instructions to download the images and the videos. Put images under ../data/seed_bench/SEED-Bench-image. Note that we only use image subset to evaluate.
  2. Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/seed.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/seed.sh

MM-Vet

  1. Extract mm-vet.zip to ../data/mmvet.
  2. Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/mmvet.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/mmvet.sh
  1. Evaluate the predictions in ../data/eval/mmvet/results using the official jupyter notebook.

Acknowledgement

Our codebase is partly built with LLaVolta and LLaVA-PruMerge.

Thanks for the great implementations!

Citation

If our code or models help your work, please cite our paper:

@article{wang2024cls,
  title={[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs},
  author={Wang, Ao and Sun, Fengyuan and Chen, Hui and Lin, Zijia and Han, Jungong and Ding, Guiguang},
  journal={arXiv preprint arXiv:2412.05819},
  year={2024}
}

About

official repo for paper "[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published