AIDC-AI
diff --git a/‎README.md‎
Lines changed: 29 additions & 37 deletions b/‎README.md‎
Lines changed: 29 additions & 37 deletions
diff --git a/‎docs/performance/Ovis2.png‎
584 KB b/‎docs/performance/Ovis2.png‎
584 KB
diff --git a/‎ovis/model/__init__.py‎
Lines changed: 7 additions & 0 deletions b/‎ovis/model/__init__.py‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎ovis/model/configuration_ovis.py‎
Lines changed: 1 addition & 1 deletion b/‎ovis/model/configuration_ovis.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎ovis/model/conversation_formatter.py‎
Lines changed: 1 addition & 0 deletions b/‎ovis/model/conversation_formatter.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎ovis/model/modeling_ovis.py‎
Lines changed: 29 additions & 22 deletions b/‎ovis/model/modeling_ovis.py‎
Lines changed: 29 additions & 22 deletions
diff --git a/‎ovis/model/visual_tokenizer/aimv2_visual_tokenizer.py‎
Lines changed: 43 additions & 0 deletions b/‎ovis/model/visual_tokenizer/aimv2_visual_tokenizer.py‎
Lines changed: 43 additions & 0 deletions
@@ -7,26 +7,25 @@ Ovis (Open VISion) is a novel Multimodal Large Language Model (MLLM) architectur
 </div>
 
 ## Release
-- [11/26] 🔥 Announcing [Ovis1.6-Gemma2-27B](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-27B)!
-- [11/04] 🔥 Announcing quantized versions of Ovis1.6: [Ovis1.6-Gemma2-9B-GPTQ-Int4](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B-GPTQ-Int4) and [Ovis1.6-Llama3.2-3B-GPTQ-Int4](https://huggingface.co/AIDC-AI/Ovis1.6-Llama3.2-3B-GPTQ-Int4)!
-- [10/22] 🔥 Announcing Ovis1.6-Llama3.2-3B ([Model](https://huggingface.co/AIDC-AI/Ovis1.6-Llama3.2-3B), [Demo](https://huggingface.co/spaces/AIDC-AI/Ovis1.6-Llama3.2-3B))!
-- [09/19] 🔥 Announcing Ovis1.6-Gemma2-9B ([Model](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B), [Demo](https://huggingface.co/spaces/AIDC-AI/Ovis1.6-Gemma2-9B))! This latest release further enhances high-resolution image processing, is trained on a larger, more diverse, and higher-quality dataset, and refines the training process with DPO training following instruction-tuning.
-- [07/24] 🔥 Introducing Ovis1.5, featuring improved high-resolution image processing and optimized training data for enhanced performance.
-- [06/14] 🔥 Launch of Ovis1.0, the inaugural version of the Ovis model.
+- [25/01/26] 🔥 Launch of [Ovis2-1/2/4/8/16/34B](https://huggingface.co/AIDC-AI/Ovis2-34B), the latest version of Ovis models, featuring breakthrough small-model performance, enhanced reasoning capabilities, advanced video and multi-image processing, expanded multilingual OCR support, and improved high-resolution image handling.
+- [24/11/26] 🔥 Announcing [Ovis1.6-Gemma2-27B](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-27B)!
+- [24/11/04] 🔥 Announcing quantized versions of Ovis1.6: [Ovis1.6-Gemma2-9B-GPTQ-Int4](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B-GPTQ-Int4) and [Ovis1.6-Llama3.2-3B-GPTQ-Int4](https://huggingface.co/AIDC-AI/Ovis1.6-Llama3.2-3B-GPTQ-Int4)!
+- [24/10/22] 🔥 Announcing Ovis1.6-Llama3.2-3B ([Model](https://huggingface.co/AIDC-AI/Ovis1.6-Llama3.2-3B), [Demo](https://huggingface.co/spaces/AIDC-AI/Ovis1.6-Llama3.2-3B))!
+- [24/09/19] 🔥 Announcing Ovis1.6-Gemma2-9B ([Model](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B), [Demo](https://huggingface.co/spaces/AIDC-AI/Ovis1.6-Gemma2-9B))! This release further enhances high-resolution image processing, is trained on a larger, more diverse, and higher-quality dataset, and refines the training process with DPO training following instruction-tuning.
+- [24/07/24] 🔥 Introducing Ovis1.5, featuring improved high-resolution image processing and optimized training data for enhanced performance.
+- [24/06/14] 🔥 Launch of Ovis1.0, the inaugural version of the Ovis model.
 
 ## Contents
 - [Install](#install)
 - [Model](#model)
 - [Performance](#performance)
-- [Finetune](#finetune)
 - [Inference](#inference)
-- [Quantization](#quantization)
 - [Citation](#citation)
 - [Team](#team)
 - [License](#license)
 
 ## Install
-Ovis has been tested with Python 3.10, Torch 2.4.0, Transformers 4.46.2, and DeepSpeed 0.15.4. For a comprehensive list of package dependencies, please consult the `requirements.txt` file. Before finetuning or inference, please install Ovis as follows.
+Ovis has been tested with Python 3.10, Torch 2.4.0, Transformers 4.46.2, and DeepSpeed 0.15.4. For a comprehensive list of package dependencies, please consult the `requirements.txt` file.
 ```bash
 git clone [email protected]:AIDC-AI/Ovis.git
 conda create -n ovis python=3.10 -y
@@ -39,27 +38,30 @@ pip install -e .
 ## Model
 Ovis can be instantiated with popular LLMs. We provide the following Ovis MLLMs:
 
-| Ovis MLLMs        | ViT         | LLM                |                          Model Weights                          | Demo                                                             |
-|:------------------|:-----------:|:------------------:|:---------------------------------------------------------------:|:----------------------------------------------------------------:|
-| Ovis1.6-Gemma2-27B | Siglip-400M | Gemma2-27B-It       | [Huggingface](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-27B) | - |
-| Ovis1.6-Gemma2-9B | Siglip-400M | Gemma2-9B-It       | [Huggingface](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B) | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis1.6-Gemma2-9B) |
-| Ovis1.6-Llama3.2-3B | Siglip-400M | Llama-3.2-3B-Instruct       | [Huggingface](https://huggingface.co/AIDC-AI/Ovis1.6-Llama3.2-3B) | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis1.6-Llama3.2-3B) |
+| Ovis MLLMs |           ViT           |          LLM          |                      Model Weights                      |                           Demo                           |
+|:-----------|:-----------------------:|:---------------------:|:-------------------------------------------------------:|:--------------------------------------------------------:|
+| Ovis2-1B   | aimv2-large-patch14-448 | Qwen2.5-0.5B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-1B)  | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-1B)  |
+| Ovis2-2B   | aimv2-large-patch14-448 | Qwen2.5-1.5B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-2B)  | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-2B)  |
+| Ovis2-4B   | aimv2-huge-patch14-448  |  Qwen2.5-3B-Instruct  | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-4B)  | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-4B)  |
+| Ovis2-8B   | aimv2-huge-patch14-448  |  Qwen2.5-7B-Instruct  | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-8B)  | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-8B)  |
+| Ovis2-16B  | aimv2-huge-patch14-448  | Qwen2.5-14B-Instruct  | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-16B) | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-16B) |
+| Ovis2-34B  |  aimv2-1B-patch14-448   | Qwen2.5-32B-Instruct  | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-34B) |                            -                             |
 
 ## Performance
-With **29B** parameters, **Ovis1.6-Gemma2-27B** achieves exceptional performance in the [OpenCompass](https://github.com/open-compass/VLMEvalKit) benchmark, ranking among the top-tier open-source MLLMs.
 
-![performance-Ovis1_6-Gemma2-27B](docs/performance/Ovis1_6-Gemma2-27B.png)
+![performance-Ovis2](docs/performance/Ovis2.png)
 
-With just **10B** parameters, **Ovis1.6-Gemma2-9B** leads the [OpenCompass](https://github.com/open-compass/VLMEvalKit) benchmark among open-source MLLMs within **30B** parameters.
-
-![performance-Ovis1_6-Gemma2-9B](docs/performance/Ovis1_6-Gemma2-9B.png)
-
-**Ovis1.6-Llama3.2-3B** leads the [OpenCompass](https://github.com/open-compass/VLMEvalKit) benchmark among open-source MLLMs under **4B** parameters, even surpassing Llama-3.2-11B-Vision-Instruct.
-
-![performance-Ovis1_6-Llama3_2-3B](docs/performance/Ovis1_6-Llama3_2-3B.png)
-
-## Finetune
-Finetuning Ovis1.6-Gemma2-9B is supported in [ms-swift](https://github.com/modelscope/ms-swift).
+|Benchmark|Ovis2-1B|Ovis2-2B|Ovis2-4B|Ovis2-8B|Ovis2-16B|Ovis2-34B|
+|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+|MMBench-V1.1<sub>test</sub>|68.5|77.2|81.4|83.3|85.2|86.2|
+|MMStar|52.0|59.0|61.7|64.4|66.9|69.4|
+|MMMU<sub>val</sub>|36.0|45.3|48.0|59.0|59.6|65.6|
+|MathVista<sub>testmini</sub>|59.5|64.4|69.1|71.4|74.9|77.0|
+|HallBench<sub>avg</sub>|44.5|50.2|54.0|56.0|55.9|58.8|
+|AI2D<sub>test</sub>|76.8|82.6|85.5|86.8|86.1|88.4|
+|OCRBench|88.7|87.5|91.0|89.3|88.2|89.8|
+|MMVet|50.3|58.6|65.5|68.5|68.4|75.5|
+|Average|59.5|65.6|69.5|72.3|73.1|76.3|
 
 ## Inference
 We provide an inference wrapper in `ovis/serve/runner.py`, which can be used as:
@@ -77,16 +79,6 @@ Based on [Gradio](https://github.com/gradio-app/gradio), Ovis can also be access
 python ovis/serve/server.py --model_path MODEL_PATH --port PORT
 ```
 
-## Quantization
-We quantized Ovis1.6 using AutoGPTQ. For detailed information on running and creating your own quantized version, please refer to the respective Huggingface model cards: [Ovis1.6-Gemma2-9B-GPTQ-Int4](https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B-GPTQ-Int4) and [Ovis1.6-Llama3.2-3B-GPTQ-Int4](https://huggingface.co/AIDC-AI/Ovis1.6-Llama3.2-3B-GPTQ-Int4). Quantized Ovis1.6 maintains performance comparable to its non-quantized counterpart while requiring less GPU memory:
-
-- Benchmark performance:
-![performance-Ovis1_6-Gemma2-9B-GPTQ-Int4](docs/performance/Ovis1_6-Gemma2-9B-GPTQ-Int4.png)
-![performance-Ovis1_6-Llama3_2-3B-GPTQ-Int4](docs/performance/Ovis1_6-Llama3_2-3B-GPTQ-Int4.png)
-
-- GPU memory usage (max_partition=9):
-![performance-Ovis1_6-VRAM-Comparison](docs/performance/Ovis1_6-VRAM-Comparison.png)
-
 ## Citation
 If you find Ovis useful, please cite the paper
 ```
@@ -99,7 +91,7 @@ If you find Ovis useful, please cite the paper
 ```
 
 ## Team
-This work is a collaborative effort by the MarcoVL team. We would also like to provide links to the following MLLM papers from our team:
+This work is a collaborative effort by the Alibaba Ovis team. We would also like to provide links to the following MLLM papers from our team:
 - [Parrot: Multilingual Visual Instruction Tuning](https://arxiv.org/abs/2406.02539)
 - [Wings: Learning Multimodal LLMs without Text-only Forgetting](https://arxiv.org/abs/2406.03496)
 
 
@@ -1,2 +1,9 @@
+from transformers import AutoConfig, AutoModel
+from .visual_tokenizer.configuration_aimv2 import AIMv2Config
+from .visual_tokenizer.modeling_aimv2 import AIMv2Model
 from .visual_tokenizer.clip_visual_tokenizer import ClipVisualTokenizerConfig, ClipVisualTokenizer
 from .visual_tokenizer.siglip_visual_tokenizer import SiglipVisualTokenizerConfig, SiglipVisualTokenizer
+from .visual_tokenizer.aimv2_visual_tokenizer import Aimv2VisualTokenizerConfig, Aimv2VisualTokenizer
+
+AutoConfig.register('aimv2', AIMv2Config)
+AutoModel.register(AIMv2Config, AIMv2Model)
@@ -10,7 +10,7 @@ def __init__(
         self,
         llm_config: Optional[Union[PretrainedConfig, dict]] = None,
         visual_tokenizer_config: Optional[Union[PretrainedConfig, dict]] = None,
-        multimodal_max_length=8192,
+        multimodal_max_length=32768,
         hidden_size=None,
         conversation_formatter_class=None,
         llm_attn_implementation=None,
 
@@ -15,6 +15,7 @@ def __init__(self, tokenizer):
         self.image_token = IMAGE_TOKEN
         self.image_token_id = IMAGE_TOKEN_ID
         self.ignore_id = IGNORE_ID
+        self.im_end = None
 
     def _tokenize_with_image_symbol(self, text):
         text_chunks = [self.tokenizer(chunk, add_special_tokens=False).input_ids for chunk in
 
@@ -83,8 +83,7 @@ def _merge_modules(modules_list: tuple):
         self.is_parallelizable = all((self.llm.is_parallelizable, self.visual_tokenizer.is_parallelizable))
         self.supports_gradient_checkpointing = all(
             (self.llm.supports_gradient_checkpointing, self.visual_tokenizer.supports_gradient_checkpointing))
-        self._supports_flash_attn_2 = all(
-            (self.llm._supports_flash_attn_2, self.visual_tokenizer._supports_flash_attn_2))
+        self._supports_flash_attn_2 = True
         self._supports_sdpa = all((self.llm._supports_sdpa, self.visual_tokenizer._supports_sdpa))
 
     def get_text_tokenizer(self):
@@ -147,7 +146,7 @@ def forward(
         pixel_values: List[Optional[torch.Tensor]],
         **kwargs
     ):
-        assert self.training, "`forward` can only be used in training. For inference, use `generate`."
+        # assert self.training, "`forward` can only be used in training. For inference, use `generate`."
         _, inputs_embeds, labels, attention_mask = self.merge_multimodal(
             text_input_ids=input_ids,
             text_attention_masks=attention_mask,
@@ -161,7 +160,8 @@ def merge_multimodal(
         text_input_ids: torch.Tensor,
         text_attention_masks: torch.Tensor,
         text_labels: Optional[torch.Tensor],
-        pixel_values: List[Optional[torch.Tensor]]
+        pixel_values: List[Optional[torch.Tensor]],
+        left_padding: bool = False
     ):
         input_device = text_input_ids.device
         visual_vocab_szie = self.get_visual_tokenizer().config.vocab_size
@@ -202,7 +202,8 @@ def merge_multimodal(
                 visual_input_ids = [None] * len(num_images)
                 visual_labels = [None] * len(num_images)
             # just placeholders
-            text_labels = torch.full(text_input_ids.shape, IGNORE_ID, dtype=torch.long, device=input_device)
+            if text_labels is None:
+                text_labels = torch.full(text_input_ids.shape, IGNORE_ID, dtype=torch.long, device=input_device)
 
         input_embeds = []
         attention_masks = []
@@ -254,29 +255,30 @@ def merge_multimodal(
             attention_masks.append(attention_mask)
             labels.append(label)
 
-        if self.training:  # padding to self.config.multimodal_max_length for increased training speed
-            padding_size = max(0, self.config.multimodal_max_length - len(input_embeds[0]))
-            input_embeds[0] = torch.nn.ConstantPad2d((0, 0, 0, padding_size), 0.0)(input_embeds[0])
-            attention_masks[0] = torch.nn.ConstantPad1d((0, padding_size), False)(attention_masks[0])
-            labels[0] = torch.nn.ConstantPad1d((0, padding_size), IGNORE_ID)(labels[0])
-        batch_input_embeds = torch.nn.utils.rnn.pad_sequence(input_embeds, batch_first=True, padding_value=0.0)[:,
-                             :self.config.multimodal_max_length, :]
-        batch_attention_mask = torch.nn.utils.rnn.pad_sequence(attention_masks, batch_first=True, padding_value=False)[
-                               :,
-                               :self.config.multimodal_max_length]
-        batch_labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_ID)[:,
-                       :self.config.multimodal_max_length]
+        batch_input_embeds = self.pad_truncate_sequence(input_embeds, batch_first=True, padding_value=0.0, left_padding=left_padding)
+        batch_attention_mask = self.pad_truncate_sequence(attention_masks, batch_first=True, padding_value=False, left_padding=left_padding)
+        batch_labels = self.pad_truncate_sequence(labels, batch_first=True, padding_value=IGNORE_ID, left_padding=left_padding)
 
         return visual_input_ids, batch_input_embeds, batch_labels, batch_attention_mask
 
+    def pad_truncate_sequence(self, sequences: List[torch.Tensor], batch_first: bool = True, padding_value: float = 0.0, left_padding: bool = False) -> torch.Tensor:
+        if not left_padding:
+            pad_sequence = torch.nn.utils.rnn.pad_sequence(sequences, batch_first=batch_first, padding_value=padding_value)
+            return pad_sequence[:,:self.config.multimodal_max_length]
+        else:
+            pad_sequence = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in sequences],batch_first=True, padding_value=padding_value).flip(dims=[1])
+            return pad_sequence[:,-self.config.multimodal_max_length:]
+
     def preprocess_inputs(
         self,
         text_or_conversations: Union[List[Dict], str],
         images: Optional[List[PIL.Image.Image]],
         max_partition=9,
         generation_preface='',
         return_labels=False,
-        propagate_exception=True
+        propagate_exception=True,
+        frame_selector=None,
+        frame_selector_kwargs=None
     ):
         # convert text to conversations
         if isinstance(text_or_conversations, str):
@@ -290,6 +292,10 @@ def preprocess_inputs(
             raise ValueError(f'Invalid type of `text_or_conversations`, expected `List[Dict]` or `str`,'
                              f' but got {type(text_or_conversations)}')
 
+        if frame_selector is not None:
+            frame_selector_kwargs = frame_selector_kwargs or {}
+            conversations, images = frame_selector(conversations=conversations, frames=images, **frame_selector_kwargs)
+
         # format conversations
         prompt, raw_input_ids, raw_labels = self.get_conversation_formatter().format(
             conversations, generation_preface=generation_preface)
@@ -408,22 +414,23 @@ def _get_hybrid_cache_for_llm(self, batch_size: int, max_cache_len: int):
             llm._cache.reset()
         return llm._cache
 
-    # TODO: support batch generation
     def generate(
         self,
         inputs: Optional[torch.Tensor] = None,
         **kwargs
     ) -> Union[GenerateOutput, torch.LongTensor]:
-        assert inputs.shape[0] == 1, 'Currently, only support `batch_size=1`'
         _, inputs_embeds, labels, attention_mask = self.merge_multimodal(
             text_input_ids=inputs,
             text_attention_masks=kwargs.pop('attention_mask'),
             text_labels=None,
-            pixel_values=kwargs.pop('pixel_values')
+            pixel_values=kwargs.pop('pixel_values'),
+            left_padding=True
         )
+        inputs_embeds = inputs_embeds.detach()
+        torch.cuda.empty_cache()
         if getattr(self.generation_config, 'cache_implementation') == 'hybrid':  # mainly for Gemma2
             kwargs['past_key_values'] = self._get_hybrid_cache_for_llm(
-                getattr(kwargs, "num_beams", 1), kwargs['max_new_tokens'] + inputs_embeds.shape[-2])
+                getattr(kwargs, "num_beams", inputs_embeds.shape[0]), kwargs['max_new_tokens'] + inputs_embeds.shape[-2])
             self.get_llm()._supports_cache_class = True
             kwargs['cache_implementation'] = None
 
 
@@ -0,0 +1,43 @@
+from transformers import AutoConfig, AutoModel
+from transformers import CLIPImageProcessor
+from .modeling_aimv2 import AIMv2Model
+from .base_visual_tokenizer import BaseVisualTokenizerConfig, BaseVisualTokenizer
+
+MODEL_TYPE = "aimv2_visual_tokenizer"
+
+
+class Aimv2VisualTokenizerConfig(BaseVisualTokenizerConfig):
+    model_type = MODEL_TYPE
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        if self.drop_cls_token:
+            self.drop_cls_token = False
+        if self.depths:
+            assert len(self.depths) == 1
+            self.backbone_kwargs['num_hidden_layers'] = self.depths[0]
+
+
+class Aimv2VisualTokenizer(BaseVisualTokenizer):
+    config_class = Aimv2VisualTokenizerConfig
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["AIMv2ViTPreprocessor", "AIMv2Block"]
+    _image_processor_class = CLIPImageProcessor
+    _image_processor_kwargs = dict(do_center_crop=False)
+    _backbone_class = AIMv2Model
+
+    def get_monitor_tensors(self):
+        return dict(
+            backbone_bottom=self.backbone.trunk.blocks[0].attn.qkv.weight,
+            backbone_top=self.backbone.trunk.blocks[-1].attn.qkv.weight,
+            head=self.head[0].weight
+        )
+
+    def get_image_size(self):
+        height = self.image_processor.crop_size["height"]
+        width = self.image_processor.crop_size["width"]
+        return height, width
+
+
+AutoConfig.register(MODEL_TYPE, Aimv2VisualTokenizerConfig)
+AutoModel.register(Aimv2VisualTokenizerConfig, Aimv2VisualTokenizer)