This is the official implementation of VTC-CLS, a state-of-the-art effective method for training-free visual token compression in Multimodal Large Language Models. Our VTC-CLS is simple and can serve as a plug-and-play method to accelerate the inference of MLLMs in a training free manner, showing high practicality.
- [2024.12.10] we open-sourced our code!
conda create -n VTC-CLS python=3.10
pip install -r requirements.txt
- Download LLaVA-1.5-7B and put it at
../models/
.
We tested our VTC-CLS method on various models with different compression ratios, and display LLaVA results here. Compared with existing methods including FastV and LLaVA-prumerge, our method is state-of-the-art in training-free manner.
We measure the evaluation time and show our method can effectively speed-up the inference process of MLLMs. We display the inference time of LLaVA-v1.5-7B on some test datasets before and after applying our VTC-CLS method.
You can simply run scripts under ./scripts/v1_5/eval. You should specify the start layer and the token num to keep in command line(except for reproduce).
- Download the data and evaluation scripts following the official instructions and put under
../data/gqa/data
. You may need to modifyeval.py
as this due to the missing assets in the GQA v1.2 release. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/gqa.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/gqa.sh
- Under
../data/scienceqa
, downloadimages
,pid_splits.json
,problems.json
from thedata/scienceqa
folder of the ScienceQA repo. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/sqa.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/sqa.sh
- Download
TextVQA_0.5.1_val.json
and images and extract to../data/textvqa
. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/textvqa.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/textvqa.sh
- Download
coco
from POPE and put under../data
. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/pope.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/pope.sh
- Download
mmbench_dev_20230712.tsv
and put under../data/mmbench
. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/mmbench.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/mmbench.sh
- Submit the results to the evaluation server:
../data/eval/mmbench/answers_upload/mmbench_dev_20230712
.
- Download
mmbench_dev_cn_20231003.tsv
and put under../data/mmbench
. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/mmbench_cn.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/mmbench_cn.sh
- Submit the results to the evaluation server:
../data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003
.
- Following the official instructions to download the images and the videos. Put images under
../data/seed_bench/SEED-Bench-image
. Note that we only use image subset to evaluate. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/seed.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/seed.sh
- Extract
mm-vet.zip
to../data/mmvet
. - Single-GPU or Multi-GPU inference and evaluate.
method=VTC-CLS # Option: {FastV, llava_prumerge}
bash scripts/v1_5/eval/$method/mmvet.sh $layer $token_num
bash scripts/v1_5/eval/reproduce/mmvet.sh
- Evaluate the predictions in
../data/eval/mmvet/results
using the official jupyter notebook.
Our codebase is partly built with LLaVolta and LLaVA-PruMerge.
Thanks for the great implementations!
If our code or models help your work, please cite our paper:
@article{wang2024cls,
title={[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs},
author={Wang, Ao and Sun, Fengyuan and Chen, Hui and Lin, Zijia and Han, Jungong and Ding, Guiguang},
journal={arXiv preprint arXiv:2412.05819},
year={2024}
}