Awesome-Multimodal-Papers

A curated list of awesome Multimodal studies.

Awesome-Multimodal-Papers
- Multimodal Papers
- Paper Notes

Multimodal Papers

Visual Understanding

Title	Venue	Date	Code	Supplement
✨ Apollo: An Exploration of Video Understanding in Large Multimodal Models (Exploration) (Meta)	arXiv	2024-12-13
CompCap: Improving Multimodal Large Language Models with Composite Captions (Meta)	arXiv	2024-12-09	-	-
✨ Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling (InternVL 2.5)	arXiv	2024-12-06
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs	arXiv	2024-10-21	-
✨ Video Instruction Tuning With Synthetic Data (LLaVA-Video, LLaVA-NeXT Series)	arXiv	2024-10-03
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions	arXiv	2024-09-26	-
Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model (MGLMM, Alibaba)	arXiv	2024-09-20
POINTS: Improving Your Vision-language Model with Affordable Strategies (WeChat)	arXiv	2024-09-07		-
✨ xGen-MM (BLIP-3): A Family of Open Large Multimodal Models	arXiv	2024-08-16
✨ LLaVA-OneVision: Easy Visual Task Transfer (LLaVA-NeXT Series)	arXiv	2024-08-06
Tarsier: Recipes for Training and Evaluating Large Video Description Models (Tarsier, Dream1k, by ByteDance)	arXiv	2024-07-30
✨ InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	arXiv	2024-07-03		-
TokenPacker: Efficient Visual Projector for Multimodal LLM	arXiv	2024-07-02		-
✨ Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (Cambrian, Data Rationing)	arXiv	2024-06-24
✨ Long Context Transfer from Language to Vision (LongVA, by Ziwei Liu, Chunyuan Li)	arXiv	2024-06-24
Generative Visual Instruction Tuning	arXiv	2024-06-17		-
✨ VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding	arXiv	2024-06-13
✨ 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (Apple)	arXiv	2024-06-13
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	arXiv	2024-06-11		-
Wings: Learning Multimodal LLMs without Text-only Forgetting	arXiv	2024-06-05	-	-
Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment (MIVPG)	arXiv	2024-06-05	-	-
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM	arXiv	2024-06-05		-
OLIVE: Object Level In-Context Visual Embeddings	ACL 2024	2024-06-02		-
X-VILA: Cross-Modality Alignment for Large Language Model (by NVIDIA)	arXiv	2024-05-29	-
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models	arXiv	2024-05-24		-
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models	arXiv	2024-05-24	-	-
LOVA3: Learning to Visual Question Answering, Asking and Assessment	arXiv	2024-05-23		-
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability	arXiv	2024-05-23
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Meta)	arXiv	2024-05-16
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts	arXiv	2024-05-09
ImageInWords: Unlocking Hyper-Detailed Image Descriptions (Google)	arXiv	2024-05-05
✨ What matters when building vision-language models? (Idefics2)	arXiv	2024-05-03
MANTIS: Interleaved Multi-Image Instruction Tuning	arXiv	2024-05-02
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs	CVPR 2024 Workshop	2024-04-23	-
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models	arXiv	2024-04-19
MoVA: Adapting Mixture of Vision Experts to Multimodal Context	arXiv	2024-04-19		-
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models	arXiv	2024-04-18	-
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? (LaDiC)	NAACL 2024	2024-04-16		-
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (AesExpert, AesMMIT Dataset)	arXiv	2024-04-15		-
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (Ferret-v2)	arXiv	2024-04-11	-	-
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (MiniCPM series)	arXiv	2024-04-09
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Ferret-UI)	arXiv	2024-04-08	-	-
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	CVPR 2024	2024-04-08
Koala: Key frame-conditioned long video-LLM	CVPR 2024	2024-04-05
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens	arXiv	2024-04-04
LongVLM: Efficient Long Video Understanding via Large Language Models	arXiv	2024-04-04		-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	ECCV 2024	2024-03-22		-
VideoAgent: Long-form Video Understanding with Large Language Model as Agent (key frame)	arXiv	2024-03-15	-	-
✨ MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training (Apple)	arXiv	2024-03-14	-	-
UniCode: Learning a Unified Codebook for Multimodal Large Language Models	arXiv	2024-03-14	-	-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	arXiv	2024-03-08	-
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models	arXiv	2023-03-05		-
RegionGPT: Towards Region Understanding Vision Language Model	CVPR 2024	2024-03-04	-
All in an Aggregated Image for In-Image Learning	arXiv	2024-02-28		-
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners	CVPR 2024	2024-02-27
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages	arXiv	2024-02-25	-	-
LLMBind: A Unified Modality-Task Integration Framework	arXiv	2024-02-22	-	-
✨ ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model (ALLaVA)	arXiv	2024-02-18
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model	arXiv	2024-02-06		-
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices	arXiv	2023-12-28		-
Gemini: A Family of Highly Capable Multimodal Models	arXiv	2023-12-19	-
✨ Osprey: Pixel Understanding with Visual Instruction Tuning	CVPR 2024	2023-12-15		-
✨ VILA: On Pre-training for Visual Language Models (NVIDIA, MIT)	CVPR 2024	2023-12-12		-
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models	arXiv	2023-12-11
Prompt Highlighter: Interactive Control for Multi-Modal LLMs	CVPR 2024	2023-12-07
APoLLo : Unified Adapter and Prompt Learning for Vision Language Models	EMNLP 2023	2023-12-04
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	arXiv	2023-11-28
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models	arXiv	2023-11-22		-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	arXiv	2023-11-21
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge	CVPR 2024	2023-11-20
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	arXiv	2023-11-16		-
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	arXiv	2023-11-07		-
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	arXiv	2023-10-14
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret)	ICLR 2024	2023-10-11		-
✨ Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)	arXiv	2023-10-05
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination))	arXiv	2023-09-25
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	ICLR 2024	2023-09-14		-
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	arXiv	2023-08-24
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages (VisCPM-Chat/Paint)	ICLR 2024	2023-08-23		-
SVIT: Scaling up Visual Instruction Tuning	arXiv	2023-07-09
Kosmos-2: Grounding Multimodal Large Language Models to the World (Kosmos-2, GrIT Dataset)	arXiv	2023-06-26
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	arXiv	2023-06-07	-
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	NeurIPS 2023	2023-05-11		-
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans	arXiv	2023-05-08		-
VPGTrans: Transfer Visual Prompt Generator across LLMs	NeurIPS 2023	2023-05-02
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	arXiv	2023-04-27		-
✨ MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	ICLR 2024	2023-04-20
✨ Visual Instruction Tuning (LLaVA)	NeurIPS 2023	2023-04-17
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)	NeurIPS 2023	2023-02-27		-
Multimodal Chain-of-Thought Reasoning in Language Models	arXiv	2023-02-02		-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	ICML 2023	2023-01-30		-
Flamingo: a Visual Language Model for Few-Shot Learning	NeurIPS 2022	2022-04-29		-

Omni Understanding

Title	Venue	Date	Code	Supplement
[Survey] From Specific-MLLM to Omni-MLLM: A Survey about the MLLMs alligned with Multi-Modality (HIT, Peng Cheng Lab)	arXiv	2024-12-16	-	-
OMCAT: Omni Context Aware Transformer (OCTAV, OMCAT) (NVIDIA)	arXiv	2024-10-15	-
Baichuan-Omni Technical Report	arXiv	2024-10-11		-
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces	arXiv	2024-07-16
Explore the Limits of Omni-modal Pretraining at Scale (MiCo)	arXiv	2024-06-13
ViT-Lens: Towards Omni-modal Representations (TencentARC)	CVPR 2024	2023-08-20
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	NeurIPS 2023	2023-05-29		-
ImageBind: One Embedding Space To Bind Them All	CVPR 2023	2023-05-09
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	TPAMI 2024	2023-04-17

Unified Understanding and Generation

Title	Venue	Date	Code	Supplement
LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation (Meta)	arXiv	2024-12-19	-	-
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding	arXiv	2024-12-12	coming soon	-
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation (TencentARC)	arXiv	2024-12-05		-
Liquid: Language Models are Scalable Multi-modal Generators (Bytedance)	arXiv	2024-12-05		arXiv
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation (ByteDance)	arXiv	2024-12-04
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation (by deepseek)	-	2024-10-17		-
✨ Emu3: Next-Token Prediction is All You Need	arXiv	2024-09-27
✨ Show-o: One Single Transformer to Unify Multimodal Understanding and Generation	arXiv	2024-08-22
An Image is Worth 32 Tokens for Reconstruction and Generation (TiTok, by ByteDance)	arXiv	2024-06-11
✨ Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	arXiv	2024-05-27
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing	-	2024-04-25
✨ SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation	arXiv	2024-04-22		-
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	arXiv	2024-02-19
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	arXiv	2024-02-05
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action	arXiv	2023-12-28
Generative Multimodal Models are In-Context Learners (Emu2)	CVPR 2024	2023-12-20
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation	arXiv	2023-11-30
LLMGA: Multimodal Large Language Model based Generation Assistant	arXiv	2023-11-27
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation	arXiv	2023-12-14		-
Kosmos-G: Generating Images in Context with Multimodal Large Language Models	ICLR 2024	2023-10-04
✨ MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens	arXiv	2023-10-03
DreamLLM: Synergistic Multimodal Comprehension and Creation	ICLR 2024	2023-09-20
NExT-GPT: Any-to-Any Multimodal LLM	arXiv	2023-09-11
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization (LaVIT)	ICLR 2024	2023-09-09		-
Planting a SEED of Vision in Large Language Model	ICLR 2024	2023-07-16
Generative Pretraining in Multimodality (Emu1)	ICLR 2024	2023-07-11		-
Generating Images with Multimodal Language Models (GILL)	NeurIPS 2023	2023-05-26
Any-to-Any Generation via Composable Diffusion (CoDi-1)	NeurIPS 2023	2023-05-19
Grounding Language Models to Images for Multimodal Inputs and Outputs (FROMAGe)	ICML 2023	2023-01-31

Multimodal Embedding/Retrieval

Image Understanding Benchmark

Title	Venue	Date	Code	Supplement
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning	arXiv	2024-06-18	-
LOVA3: Learning to Visual Question Answering, Asking and Assessment	arXiv	2024-05-23		-
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI	arXiv	2024-04-24
BLINK: Multimodal Large Language Models Can See but Not Perceive	arXiv	2024-04-18
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret-Bench)	ICLR 2024	2023-10-11		-
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination))	arXiv	2023-09-25
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations (AffectVisDial)	ECCV 2024	2023-08-30
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	CVPR 2024	2023-07-30		-

Video Understanding Benchmark

Title	Venue	Date
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models	arXiv	2024-10-30
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark (AuroraCap, VDC)	arXiv	2024-10-24
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models	arXiv	2024-10-14
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding	NeurIPS 2024	2024-09-26
Tarsier: Recipes for Training and Evaluating Large Video Description Models (Tarsier, Dream1k) (ByteDance)	arXiv	2024-07-30
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning (VideoVista)	arXiv	2024-06-17
VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?	arXiv	2024-06-16
MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding	arXiv	2024-06-06
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (Video-MME)	arXiv	2024-05-31
TempCompass: Do Video LLMs Really Understand Videos?	arXiv	2024-03-01
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (MVBench, VideoChat2)	CVPR 2024 Highlight	2023-11-28
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding	NeurIPS 2023	2023-08-17
Perception Test: A Diagnostic Benchmark for Multimodal Video Models (Perception Test, by Google DeepMind)	NeurIPS 2023	2023-05-23

Audio

Title	Venue	Date	Code	Supplement
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities	EMNLP 2023 (Findings)	2023-05-18

Multimodal Dialogue

Title	Venue	Date	Code	Supplement
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation	arXiv	2024-03-13		-
STICKERCONV: Generating Multimodal Empathetic Responses from Scratch	ACL 2024 Main	2024-01-20
VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue	arXiv	2023-09-14	-	-
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts	ACL 2023	2023-05-24
TikTalk: A Multi-Modal Dialogue Dataset for Real-World Chitchat	ACM MM 2023	2023-01-14		Dataset
MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation	ACL 2023	2022-11-10		Dataset
Multimodal Dialogue Response Generation (Divter)	ACL 2022	2021-10-16	-	-
Maria: A Visual Experience Powered Conversational Agent	ACL 2021	2021-05-27		-
Multi-Modal Open-Domain Dialogue	EMNLP 2021	2020-10-02	-	-
Open Domain Dialogue Generation with Latent Images	AAAI 2021	2020-04-04	-	-
Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog	WWW 2020	2020-03-10		-

Multimodal Learning

Title	Venue	Date	Code	Supplement
Video as the New Language for Real-World Decision Making	arXiv	2024-02-27	-	-
Tokenize Anything via Prompting	arXiv	2023-12-14		-
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	ICLR 2024	2023-10-03		-
ImageBind: One Embedding Space To Bind Them All	CVPR 2023	2023-05-09
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks	CVPR 2023	2022-11-17		-
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks (BEiT-3)	CVPR 2023	2022-08-22		-
BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers	arXiv	2022-08-12		-
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning	AAAI 2023	2022-06-17
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework	ICML 2022	2022-02-07		-
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	ICML 2022	2022-01-28		-
Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks	CVPR 2022	2021-12-02		-
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation (ALBEF)	NeurIPS 2021	2021-07-16
BEiT: BERT Pre-Training of Image Transformers	ICLR 2022	2021-06-15		-
Learning Transferable Visual Models From Natural Language Supervision	ICML 2021	2021-02-26
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision	ICML 2021	2021-02-05		-

Image Generation

Title	Venue	Date	Code	Supplement
✨ OmniGen: Unified Image Generation	arXiv	2024-09-17
✨ Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens (by Kaiming He, DeepMind, MIT)	arXiv	2024-10-17	-	-
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (Lumina-T2X, Flag-DiT) (Text2Any)	arXiv	2024-05-09
FreeU: Free Lunch in Diffusion U-Net (FreeU, by Ziwei Liu)	CVPR 2024 Oral	2023-09-20
Lazy Diffusion Transformer for Interactive Image Editing	arXiv	2024-04-18	-
Salient Object-Aware Background Generation using Text-Guided Diffusion Models	CVPR 2024 Workshop	2024-04-15		-
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing	arXiv	2024-04-15
UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark (UNIAA-LLaVA, UNIAA-Bench)	arXiv	2024-04-15	-	-
PMG: Personalized Multimodal Generation with Large Language Models	WWW 2024	2024-04-07	-	-
Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models	arXiv	2024-04-05
Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models	CVPR 2024	2024-04-05	-	-
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (VAR)	arXiv	2024-04-03
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation (HuaWei, Enze Xie)	arXiv	2024-03-07
Multi-LoRA Composition for Image Generation	arXiv	2024-02-26
PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models (HuaWei, Enze Xie)	arXiv	2024-01-10
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model	AAAI 2024	2023-12-19		-
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models (Tencent Xintao Wang)	arXiv	2023-12-11
InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following	arXiv	2023-12-11
Emu Edit: Precise Image Editing via Recognition and Generation Tasks	arXiv	2023-11-16	-
BeautifulPrompt: Towards Automatic Prompt Engineering for Text-to-Image Synthesis	EMNLP 2023	2023-11-12
AnyText: Multilingual Visual Text Generation And Editing	ICLR 2024	2023-11-06		-
EasyGen: Easing Multimodal Generation with a Bidirectional Conditional Diffusion Model and LLMs	arXiv	2023-10-13		-
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models	arXiv	2023-10-11
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis (HuaWei, Enze Xie)	ICLR 2024 Spotlight	2023-09-30
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models	arXiv	2023-08-13
Kosmos-G: Generating Images in Context with Multimodal Large Language Models	arXiv	2023-10-04
Improving Image Generation with Better Captions (DALL-E 3)	OpenAI	2023	-	-
Scaling up GANs for Text-to-Image Synthesis (GigaGAN)	CVPR 2023	2023-05-09
Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)	ICCV 2023	2023-02-10		-
Scalable Diffusion Models with Transformers (DiT)	ICCV 2023	2022-12-19
InstructPix2Pix: Learning to Follow Image Editing Instructions	CVPR 2023	2022-11-17
All are Worth Words: A ViT Backbone for Diffusion Models (U-ViT, first Diffsuion Transformer) (RUC, Chongxuan Li)	CVPR 2023	2022-09-25		-
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation	CVPR 2023	2022-08-25
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)	NeurIPS 2022	2022-05-23
Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)	OpenAI	2022-04-13		-
High-Resolution Image Synthesis with Latent Diffusion Models (LDM, Stable Diffusion)	CVPR 2022	2021-12-20		-
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models	ICML 2022	2021-12-20		-
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion	ECCV 2022	2021-11-24		-
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations	ICLR 2022	2021-08-02
CogView: Mastering Text-to-Image Generation via Transformers	NeurIPS 2021	2021-05-26		-
Zero-Shot Text-to-Image Generation (DALL-E 1)	ICML 2021	2021-02-24
Taming Transformers for High-Resolution Image Synthesis (VQ-GAN)	CVPR 2021	2020-12-17

Video Generation

Title	Venue	Date	Code	Supplement
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment	arXiv	2024-12-06
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions (Mira)	arXiv	2024-07-08		-
VIMI: Grounding Video Generation through Multi-modal Instruction	arXiv	2024-07-08
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation	arXiv	2024-07-02	-	-
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (Lumina-T2X, Flag-DiT) (Text2Any)		2024-05-09
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text (Long Video Generation)	arXiv	2024-03-21
AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks	arXiv	2024-03-21
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation (FRESCO) (NTU, Ziwei Liu)	CVPR 2024	2024-03-19
Latte: Latent Diffusion Transformer for Video Generation (Latte) (NTU, Ziwei Liu)	arXiv	2024-01-05
FreeInit: Bridging Initialization Gap in Video Diffusion Models (FreeInit) (NTU, Ziwei Liu)	arXiv	2023-12-12
VideoBooth: Diffusion-based Video Generation with Image Prompts (VideoBooth) (NTU, Ziwei Liu)	arXiv	2023-12-01
VBench: Comprehensive Benchmark Suite for Video Generative Models [Benchmark] (VBench) (NTU, Ziwei Liu)	CVPR 2024	2023-11-29
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets (SVD)	arXiv	2023-11-25
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction (NTU, Ziwei Liu)	ICLR 2024	2023-10-31
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling (FreeNoise) (NTU, Ziwei Liu)	ICLR 2024	2023-10-23
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models (LaVie) (NTU, Ziwei Liu)		2023-09-26

Multimodal Dataset

Title	Venue	Date	Annotation	Source
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data	arXiv	2024-10-24	-
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions	NeurIPS 2024	2024-10-14
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens	arXiv	2024-06-17
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions (Mira)	arXiv	2024-07-08	Video Generation
GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension	IJCAI 2024	2024-06-26	-
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation	arXiv	2024-06-15		-
TextSquare: Scaling up Text-Centric Visual Instruction Tuning	arXiv	2024-04-19	Visual Instruction Tuning	-
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing	arXiv	2024-04-15	Instruction Image Editing
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (AesExpert, AesMMIT Dataset)	arXiv	2024-04-15	Aesthetic Multi-Modality Instruction Tuning
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers	CVPR 2024	2024-02-29	video-caption
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model	arXiv	2024-02-18	GPT4V-synthesized Data
STICKERCONV: Generating Multimodal Empathetic Responses from Scratch	ACL 2024 Main	2024-01-20	Multimodal Empathetic Dialogue
SVIT: Scaling up Visual Instruction Tuning	arXiv	2023-07-09	Instruction Tuning
Kosmos-2: Grounding Multimodal Large Language Models to the World (Kosmos-2, GrIT Dataset)	arXiv	2023-06-26	Grounded image-text pairs
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	arXiv	2023-06-07	Instruction Tuning
Visual Instruction Tuning (LLaVA)	NeurIPS 2023	2023-04-17	Instruction Tuning
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text	NeurIPS D&B 2023	2023-04-14	Interleaved Image-Text
TikTalk: A Multi-Modal Dialogue Dataset for Real-World Chitchat	ACM MM 2023	2023-01-14	Multimodal Dialogue
MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation	ACL 2023	2022-11-10	Multimodal Dialogue
LAION-5B: An open large-scale dataset for training next generation image-text models	NeurIPS 2022	2022-10-16	Image-Text Pairs
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs	NeurIPS Workshop 2021	2021-11-03	Image-Text Pairs
MMConv: An Environment for Multimodal Conversational Search across Multiple Domains	ACM SIGIR 2021	2021-07	Multimodal Dialogue
PhotoChat: A Human-Human Dialogue Dataset With Photo Sharing Behavior For Joint Image-Text Modeling	ACL 2021	2021-07-06	Open-domain Multimodal Dialogue
Image-Chat: Engaging Grounded Conversations	ACL 2020	2018-11-02	Multimodal Dialogue

Multimodal Summary

Title	Venue	Date	Latest Update
From Specific-MLLM to Omni-MLLM: A Survey about the MLLMs alligned with Multi-Modality (HIT, Peng Cheng Lab)	arXiv	2024-12-16	-
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding (TikTok)	arXiv	2024-09-27	-
Video Diffusion Models: A Survey	arXiv	2024-05-06	-
Theoretical research on generative diffusion models: an overview	arXiv	2024-04-13	-
A Review of Multi-Modal Large Language and Vision Models	arXiv	2024-03-28	-
The (R)Evolution of Multimodal Large Language Models: A Survey	arXiv	2024-02-19	-
MM-LLMs: Recent Advances in MultiModal Large Language Models	arXiv	2024-01-24	2024-02-20
Multimodal Large Language Models: A Survey	IEEE BigData 2023	2023-11-22	-
Multimodal Foundation Models: From Specialists to General-Purpose Assistants	CVPR 2023	2023-09-18	-
Understanding Deep Learning	-	2023	-
Large Multimodal Models: Notes on CVPR 2023 Tutorial	CVPR 2023	2023-06-26	-
A Survey on Multimodal Large Language Models	arXiv	2023-06-23	2024-04-01
Multimodal Deep Learning	arXiv	2023-01-12	-
Diffusion Models: A Comprehensive Survey of Methods and Applications	ACM Computing Surveys	2022-09-02	2024-02-06
Multimodal Learning with Transformers: A Survey	IEEE TPAMI 2023	2022-01-13	2023-05-10
Multimodal Machine Learning: A Survey and Taxonomy	IEEE PAMI 2019	2017-05-26	2017-08-01

Paper Notes

here

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
notes		notes
LICENSE		LICENSE
Notes_zh.md		Notes_zh.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Multimodal-Papers

Multimodal Papers

Visual Understanding

Omni Understanding

Unified Understanding and Generation

Multimodal Embedding/Retrieval

Image Understanding Benchmark

Video Understanding Benchmark

Audio

Multimodal Dialogue

Multimodal Learning

Image Generation

Video Generation

Multimodal Dataset

Multimodal Summary

Paper Notes

About

Releases

Packages

Languages

License

friedrichor/Awesome-Multimodal-Papers

Folders and files

Latest commit

History

Repository files navigation

Awesome-Multimodal-Papers

Multimodal Papers

Visual Understanding

Omni Understanding

Unified Understanding and Generation

Multimodal Embedding/Retrieval

Image Understanding Benchmark

Video Understanding Benchmark

Audio

Multimodal Dialogue

Multimodal Learning

Image Generation

Video Generation

Multimodal Dataset

Multimodal Summary

Paper Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages