Colorful Multimodal Research

Welcome to our meticulously assembled anthology of vibrant multimodal research, encompassing an array of domains including Vision, Audio, Agent, Robotics, Fundamental Sciences, and Ominous including anything you want. Our collection primarily focuses on the advancements propelled by large language models (LLMs), complemented by an assortment of related collections.

👀 Vision

🖼 Image

Collection of works about Image + LLMs, Diffusion, see Image for details

Image Understanding

Reading List

Datasets & Benchmarks

Image Generation

Reading List

Open-source Projects

Related Collections (Understanding)

VLM_survey , This is the repository of "Vision Language Models for Vision Tasks: a Survey", a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc.
Awesome-Multimodal-Large-Language-Models , A curated list of Multimodal Large Language Models (MLLMs), including datasets, multimodal instruction tuning, multimodal in-context learning, multimodal chain-of-thought, llm-aided visual reasoning, foundation models, and others. This list will be updated in real time.
LLM-in-Vision , Recent LLM (Large Language Models)-based CV and multi-modal works
Awesome-Transformer-Attention , This repo contains a comprehensive paper list of Vision Transformer & Attention, including papers, codes, and related websites
Multimodal-AND-Large-Language-Models , Paper list about multimodal and large language models, only used to record papers I read in the daily arxiv for personal needs.
Efficient_Foundation_Model_Survey , This repo contains the paper list and figures for A Survey of Resource-efficient LLM and Multimodal Foundation Models.
CVinW_Readings , A collection of papers on the topic of Computer Vision in the Wild (CVinW)
Awesome-Vision-and-Language , A curated list of awesome vision and language resources
Awesome-Multimodal-Research , This repo is reorganized from Awesome-Multimodal-ML
Awesome-Multimodal-ML , Reading list for research topics in multimodal machine learning
Awesome-Referring-Image-Segmentation , A collection of referring image (video, 3D) segmentation papers and datasets.
Awesome-Prompting-on-Vision-Language-Model , This repo lists relevant papers summarized in our survey paper: A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models.
Mamba-in-CV , A paper list of some recent Mamba-based CV works. If you find some ignored papers, please open issues or pull requests.
Efficient-Multimodal-LLMs-Survey , Efficient Multimodal Large Language Models: A Survey

Related Collections (Evaluation)

Awesome-MLLM-Hallucination , A curated list of resources dedicated to hallucination of multimodal large language models (MLLM)
awesome-Large-MultiModal-Hallucination ,

Related Collections (Generation)

Awesome-VQVAE , A collection of resources and papers on Vector Quantized Variational Autoencoder (VQ-VAE) and its application
Awesome-Diffusion-Models , This repository contains a collection of resources and papers on Diffusion Models
Awesome-Controllable-Diffusion , Collection of papers and resources on Controllable Generation using Diffusion Models, including ControlNet, DreamBooth, and others.
Awesome-LLMs-meet-Multimodal-Generation , A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).

Tutorials

[CVPR2024 Tutorial] Recent Advances in Vision Foundation Models
- Large Multimodal Models: Towards Building General-Purpose Multimodal Assistant, Chunyuan Li
- Methods, Analysis & Insights from Multimodal LLM Pre-training, Zhe Gan
- LMMs with Fine-Grained Grounding Capabilities, Haotian Zhang
- A Close Look at Vision in Large Multimodal Models, Jianwei Yang
- Multimodal Agents, Linjie Li
- Recent Advances in Image Generative Foundation Models, Zhengyuan Yang
- Video and 3D Generation, Kevin Lin
[CVPR2023 Tutorial] Recent Advances in Vision Foundation Models
- Opening Remarks & Visual and Vision-Language Pre-training, Zhe Gan
- From Representation to Interface: The Evolution of Foundation for Vision Understanding, Jianwei Yang
- Alignments in Text-to-Image Generation, Zhengyuan Yang
- Large Multimodal Models, Chunyuan Li
- Multimodal Agents: Chaining Multimodal Experts with LLMs, Linjie Li
[CVPR2022 Tutorial] Recent Advances in Vision-and-Language Pre-training
[CVPR2021 Tutorial] From VQA to VLN: Recent Advances in Vision-and-Language Research
[CVPR2020 Tutorial] Recent Advances in Vision-and-Language Research

📺 Video

Collection of works about Video-Language Pretraining, Video + LLMs, see Video for details

Video Understanding

Reading List

Pretraining Tasks

Datasets

Pretraining Corpora

Video Instructions

Benchmarks

Common Downstream Tasks

Advanced Downstream Tasks

Task-Specific Benchmarks

Multifaceted Benchmarks

Metrics

Projects & Tools

Video Generation

Reading List

Metrics

Projects

Related Collections (datasets)

Awesome-Video-Datasets

Related Collections (understanding)

Awesome-LLMs-for-Video-Understanding , Latest Papers, Codes and Datasets on Vid-LLMs.
Awesome Long-Term Video Understanding , Awesome papers & datasets specifically focused on long-term videos.

Related Collections (generation)

i2vgen-xl , VGen is an open-source video synthesis codebase developed by the Tongyi Lab of Alibaba Group, featuring state-of-the-art video generative models.

📷 3D

Collection of works about 3D+LLM, see 3D for details

Reading List

Related Collections

awesome-3D-gaussian-splatting , A curated list of papers and open-source resources focused on 3D Gaussian Splatting, intended to keep pace with the anticipated surge of research in the coming months
Awesome-LLM-3D , a curated list of Multi-modal Large Language Model in 3D world Resources
Awesome-3D-Vision-and-Language , A curated list of research papers in 3D visual grounding
awesome-scene-understanding , A list of awesome scene understanding papers.

📰 Documnent

Related Collections

Awesome Document Understanding , A curated list of resources for Document Understanding (DU) topic related to Intelligent Document Processing (IDP), which is relative to Robotic Process Automation (RPA) from unstructured data, especially form Visually Rich Documents (VRDs).

Vision Encoder

Collection of existing popular vision encoder, see Vision Encoder for details

Image Encoder

Video Encoder

Audio Encoder

👂 Audio

Collection of works about audio+LLM, see Audio for details

Reading List

Related Collections

awesome-large-audio-models , Collection of resources on the applications of Large Language Models (LLMs) in Audio AI.
speech-trident , Awesome speech/audio LLMs, representation learning, and codec models
Audio-AI-Timeline , Here we will keep track of the latest AI models for waveform based audio generation, starting in 2023!

🔧 Agent

Collection of works about agent learning, see Agent for details

Reading List

Datasets & Benchmarks

Projects

Applications

Related Collections

LLM-Agent-Paper-Digest , For benefiting the research community and promoting LLM-powered agent direction, we organize papers related to LLM-powered agent that published on top conferences recently
LLMAgentPapers , Must-read Papers on Large Language Model Agents.
LLM-Agent-Paper-List , In this repository, we provide a systematic and comprehensive survey on LLM-based agents, and list some must-read papers.
XLang Paper Reading , Paper collection on building and evaluating language model agents via executable language grounding
Awesome-LLMOps , An awesome & curated list of best LLMOps tools for developers
Awesome LLM-Powered Agent , Awesome things about LLM-powered agents. Papers / Repos / Blogs / ...
Awesome LMs with Tools , Language models (LMs) are powerful yet mostly for text-generation tasks. Tools have substantially enhanced their performance for tasks that require complex skills.
ToolLearningPapers , Must-read papers on tool learning with foundation models
Awesome-ALM , This repo collect research papers about leveraging the capabilities of language models, which can be a good reference for building upper-layer applications
LLM-powered Autonomous Agents, Lil'Log, Overview: panning, memory, tool use
World Model Papers, , Paper collections of the continuous effort start from World Models

🤖 Robotic

Collection of works about robotics+LLM, see Robotic for details

Reading List

Related Collections (Robotics)

Awesome-Robotics-Foundation-Models , This is the partner repository for the survey paper "Foundation Models in Robotics: Applications, Challenges, and the Future". The authors hope this repository can act as a quick reference for roboticists who wish to read the relevant papers and implement the associated methods.
Awesome-LLM-Robotics , This repo contains a curative list of papers using Large Language/Multi-Modal Models for Robotics/RL
Simulately , a website where we gather useful information of physics simulator for cutting-edge robot learning research. It is still under active development, so stay tuned!
Awesome-Temporal-Action-Detection-Temporal-Action-Proposal-Generation , Temporal Action Detection & Weakly Supervised & Semi Supervised Temporal Action Detection & Temporal Action Proposal Generation & Open-Vocabulary Temporal Action Detection.
Awesome-TimeSeries-SpatioTemporal-LM-LLM , A professionally curated list of Large (Language) Models and Foundation Models (LLM, LM, FM) for Temporal Data (Time Series, Spatio-temporal, and Event Data) with awesome resources (paper, code, data, etc.), which aims to comprehensively and systematically summarize the recent advances to the best of our knowledge.
PromptCraft-Robotics , The PromptCraft-Robotics repository serves as a community for people to test and share interesting prompting examples for large language models (LLMs) within the robotics domain
Awesome-Robotics , A curated list of awesome links and software libraries that are useful for robots

Related Collections (embodied)

Embodied_AI_Paper_List , Awesome Paper list for Embodied AI and its related projects and applications
Awesome-Embodied-AI , A curated list of awesome papers on Embodied AI and related research/industry-driven resources
awesome-embodied-vision , Reading list for research topics in embodied vision

Related Collections (autonomous driving)

Awesome-LLM4AD , A curated list of awesome LLM for Autonomous Driving resources (continually updated)

🔬 Science

♾️ AI for Math

Collection of works about Mathematics + LLMs, see AI4Math for details

Reading List

Related Collections

Awesome-Scientific-Language-Models , A curated list of pre-trained language models in scientific domains (e.g., mathematics, physics, chemistry, biology, medicine, materials science, and geoscience), covering different model sizes (from <100M to 70B parameters) and modalities (e.g., language, vision, molecule, protein, graph, and table)

🌏 Ominous

Collection of works about LLM + ominous modality, see Ominous for details

Related Collections

Reading List

Dataset

Benchmark

Awesome-Unified-Multimodal-Models , This is a repository for organizing papers, codes and other resources related to unified multimodal models.

Contributing

Please freely create a pull request or drop me an email: flagwyx@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Colorful Multimodal Research

Table of Contents

👀 Vision

🖼 Image

📺 Video

📷 3D

📰 Documnent

Vision Encoder

👂 Audio

🔧 Agent

🤖 Robotic

🔬 Science

♾️ AI for Math

🌏 Ominous

Contributing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Colorful Multimodal Research

Table of Contents

👀 Vision

🖼 Image

📺 Video

📷 3D

📰 Documnent

Vision Encoder

👂 Audio

🔧 Agent

🤖 Robotic

🔬 Science

♾️ AI for Math

🌏 Ominous

Contributing