Releases: haotian-liu/LLaVA
Release v1.2.0 (LLaVA-1.6)
LLaVA-1.6 is out! With additional scaling to LLaVA-1.5, LLaVA-1.6-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the blog post, and explore the demo! Models are available in Model Zoo. Training/eval data and scripts coming soon.
Release v1.1.3 (Bring your own data, LoRA training)
Updates
- Support LoRA for the instruction tuning stage of LLaVA-1.5 -- comparable performance to full-model finetuning, and reduced requirements on GPU VRAM. (ckpts/logs, script)
- Bring your own data and finetune LLaVA-1.5 to your own task. (instruction)
- Basic support for Windows. (instruction)
- Fix: the training behavior with gradient accumulation is the same as large-batch training.
Notes
- A new LoRA schedule for LLaVA-1.5 is used,
- rank: 128
- alpha: 256
- lr (LoRA): 2e-4
- lr (projector): 2e-5
Release v1.1.1
In this version, we release the training scripts, data, and evaluation scripts on benchmarks for LLaVA 1.5. Bake your LLaVA today!
LLaVA-1.5 achieves SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo!
Release v1.1.0
🔥 LLaVA-1.5 is out! This release supports LLaVA-1.5 model inference and serving.
We will release the training scripts, data, and evaluation scripts on benchmarks in the coming week.
LLaVA-1.5 achieves SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo, with training and evaluation scripts coming in the next week!
Release v1.0.2
- Added model zoo
- Improved support for ScienceQA with latest training configurations
- Improved docs
We are working to continue improving the documentation. Please let us know if you find any documentation unclear, thanks!
Release v1.0.1
- Added LLaMA-2 support
- Full LoRA support. To make model training more accessible, we release a set of model weights based on LoRA, which supports training on academic resources (e.g. 4x A6000s, or 8x 3090s, without the need of CPU offloading)
- A more versatile design for training large multimodal models, including swapping different language models, vision encoders, and more coming soon
- Support higher resolution input using CLIP-ViT-L-336px as the vision encoder for a more detailed visual understanding
- Ablate and clean up some design choices to make the training simpler and smoother
- Full DeepSpeed support
- Improved model checkpoint saving during pretraining stage to save disk space
- Improved WebUI interface
- Improved support for inference with multiple-GPUs
- Support inference with 4-bit and 8-bit quantization
- Support interactive CLI inference
We train all models in this release using LLaVA-LCS-558K for pretraining and LLaVA-Instruct-80K for instruction tuning, to maintain an efficient and affordable training budget. The full training (including both pretraining and finetuning) can be completed within 6 hours on 8x 3090s.
We hope this release further benefits the community and makes large multimodal models more accessible.
Detailed Changes
- Tokenization. We remove the dependency of the additional tokens (
<IM_START>
,<IM_END>
,<IM_PATCH>
), so that during the pretraining stage, the tokenizer does not change at all and we only update the linear projector weights. - Prompt.
- Pretraining. We simplified the pretraining prompts by removing additional instructions like
Describe the image details
, which we find to allow the zero-shot inference and can slightly improve the training speed. - We keep the train/test prompt consistent, which we find to slightly improve the model's performance during the inference.
- Pretraining. We simplified the pretraining prompts by removing additional instructions like