A curated list of awesome Multimodal studies.
Title | Venue | Date | Code | Supplement |
---|---|---|---|---|
[Survey] From Specific-MLLM to Omni-MLLM: A Survey about the MLLMs alligned with Multi-Modality (HIT, Peng Cheng Lab) | arXiv | 2024-12-16 | - | - |
OMCAT: Omni Context Aware Transformer (OCTAV, OMCAT) (NVIDIA) | arXiv | 2024-10-15 | - | |
Baichuan-Omni Technical Report | arXiv | 2024-10-11 | - | |
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces | arXiv | 2024-07-16 | ||
Explore the Limits of Omni-modal Pretraining at Scale (MiCo) | arXiv | 2024-06-13 | ||
ViT-Lens: Towards Omni-modal Representations (TencentARC) | CVPR 2024 | 2023-08-20 | ||
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | NeurIPS 2023 | 2023-05-29 | - | |
ImageBind: One Embedding Space To Bind Them All | CVPR 2023 | 2023-05-09 | ||
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | TPAMI 2024 | 2023-04-17 |
Title | Venue | Date | Code | Supplement |
---|---|---|---|---|
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning | arXiv | 2024-06-18 | - | |
LOVA3: Learning to Visual Question Answering, Asking and Assessment | arXiv | 2024-05-23 | - | |
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI | arXiv | 2024-04-24 | ||
BLINK: Multimodal Large Language Models Can See but Not Perceive | arXiv | 2024-04-18 | ||
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret-Bench) | ICLR 2024 | 2023-10-11 | - | |
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination)) | arXiv | 2023-09-25 | ||
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations (AffectVisDial) | ECCV 2024 | 2023-08-30 | ||
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | CVPR 2024 | 2023-07-30 | - |
Title | Venue | Date | Code | Supplement |
---|---|---|---|---|
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities | EMNLP 2023 (Findings) | 2023-05-18 |