[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

This is the official implementation of VTC-CLS, a state-of-the-art effective method for training-free visual token compression in Multimodal Large Language Models. Our VTC-CLS is simple and can serve as a plug-and-play method to accelerate the inference of MLLMs in a training free manner, showing high practicality.

News

[2024.12.10] we open-sourced our code!

Environmental Setup

conda create -n VTC-CLS python=3.10
pip install -r requirements.txt

Download LLaVA-1.5-7B and put it at ../models/.

Performance

We tested our VTC-CLS method on various models with different compression ratios, and display LLaVA results here. Compared with existing methods including FastV and LLaVA-prumerge, our method is state-of-the-art in training-free manner.

Efficiency

We measure the evaluation time and show our method can effectively speed-up the inference process of MLLMs. We display the inference time of LLaVA-v1.5-7B on some test datasets before and after applying our VTC-CLS method.

Evaluation

You can simply run scripts under ./scripts/v1_5/eval. You should specify the start layer and the token num to keep in command line(except for reproduce).

GQA

Download the data and evaluation scripts following the official instructions and put under ../data/gqa/data. You may need to modify eval.py as this due to the missing assets in the GQA v1.2 release.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/gqa.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/gqa.sh

ScienceQA

Under ../data/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA repo.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/sqa.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/sqa.sh

TextVQA

Download TextVQA_0.5.1_val.json and images and extract to ../data/textvqa.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/textvqa.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/textvqa.sh

POPE

Download coco from POPE and put under ../data.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/pope.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/pope.sh

MMBench

Download mmbench_dev_20230712.tsv and put under ../data/mmbench.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/mmbench.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/mmbench.sh

Submit the results to the evaluation server: ../data/eval/mmbench/answers_upload/mmbench_dev_20230712.

MMBench-CN

Download mmbench_dev_cn_20231003.tsv and put under ../data/mmbench.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/mmbench_cn.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/mmbench_cn.sh

Submit the results to the evaluation server: ../data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003.

SEED-Bench

Following the official instructions to download the images and the videos. Put images under ../data/seed_bench/SEED-Bench-image. Note that we only use image subset to evaluate.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/seed.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/seed.sh

MM-Vet

Extract mm-vet.zip to ../data/mmvet.
Single-GPU or Multi-GPU inference and evaluate.

method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/mmvet.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/mmvet.sh

Evaluate the predictions in ../data/eval/mmvet/results using the official jupyter notebook.

Acknowledgement

Our codebase is partly built with LLaVolta and LLaVA-PruMerge.

Thanks for the great implementations!

Citation

If our code or models help your work, please cite our paper:

@article{wang2024cls,
  title={[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs},
  author={Wang, Ao and Sun, Fengyuan and Chen, Hui and Lin, Zijia and Han, Jungong and Ding, Guiguang},
  journal={arXiv preprint arXiv:2412.05819},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs		docs
figures		figures
images		images
llava		llava
modelconfig		modelconfig
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

News

Environmental Setup

Performance

Efficiency

Evaluation

GQA

ScienceQA

TextVQA

POPE

MMBench

MMBench-CN

SEED-Bench

MM-Vet

Acknowledgement

Citation

About

Releases

Packages

Languages

License

THU-MIG/VTC-CLS

Folders and files

Latest commit

History

Repository files navigation

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

News

Environmental Setup

Performance

Efficiency

Evaluation

GQA

ScienceQA

TextVQA

POPE

MMBench

MMBench-CN

SEED-Bench

MM-Vet

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages