Welcome to our meticulously assembled anthology of vibrant multimodal research, encompassing an array of domains including Vision, Audio, Agent, Robotics, Fundamental Sciences, and Ominous including anything you want. Our collection primarily focuses on the advancements propelled by large language models (LLMs), complemented by an assortment of related collections.
Collection of works about Image + LLMs, Diffusion, see Image for details
- Image Understanding
- Reading List
- Datasets & Benchmarks
- Image Generation
- Reading List
- Open-source Projects
Related Collections (Understanding)
- VLM_survey , This is the repository of "Vision Language Models for Vision Tasks: a Survey", a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc.
- Awesome-Multimodal-Large-Language-Models , A curated list of Multimodal Large Language Models (MLLMs), including datasets, multimodal instruction tuning, multimodal in-context learning, multimodal chain-of-thought, llm-aided visual reasoning, foundation models, and others. This list will be updated in real time.
- LLM-in-Vision , Recent LLM (Large Language Models)-based CV and multi-modal works
- Awesome-Transformer-Attention , This repo contains a comprehensive paper list of Vision Transformer & Attention, including papers, codes, and related websites
- Multimodal-AND-Large-Language-Models , Paper list about multimodal and large language models, only used to record papers I read in the daily arxiv for personal needs.
- Efficient_Foundation_Model_Survey , This repo contains the paper list and figures for A Survey of Resource-efficient LLM and Multimodal Foundation Models.
- CVinW_Readings , A collection of papers on the topic of Computer Vision in the Wild (CVinW)
- Awesome-Vision-and-Language , A curated list of awesome vision and language resources
- Awesome-Multimodal-Research , This repo is reorganized from Awesome-Multimodal-ML
- Awesome-Multimodal-ML , Reading list for research topics in multimodal machine learning
- Awesome-Referring-Image-Segmentation , A collection of referring image (video, 3D) segmentation papers and datasets.
- Awesome-Prompting-on-Vision-Language-Model , This repo lists relevant papers summarized in our survey paper: A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models.
- Mamba-in-CV , A paper list of some recent Mamba-based CV works. If you find some ignored papers, please open issues or pull requests.
- Efficient-Multimodal-LLMs-Survey , Efficient Multimodal Large Language Models: A Survey
Related Collections (Evaluation)
- Awesome-MLLM-Hallucination , A curated list of resources dedicated to hallucination of multimodal large language models (MLLM)
- awesome-Large-MultiModal-Hallucination ,
Related Collections (Generation)
- Awesome-VQVAE , A collection of resources and papers on Vector Quantized Variational Autoencoder (VQ-VAE) and its application
- Awesome-Diffusion-Models , This repository contains a collection of resources and papers on Diffusion Models
- Awesome-Controllable-Diffusion , Collection of papers and resources on Controllable Generation using Diffusion Models, including ControlNet, DreamBooth, and others.
- Awesome-LLMs-meet-Multimodal-Generation , A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
Tutorials
- [CVPR2024 Tutorial] Recent Advances in Vision Foundation Models
- Large Multimodal Models: Towards Building General-Purpose Multimodal Assistant, Chunyuan Li
- Methods, Analysis & Insights from Multimodal LLM Pre-training, Zhe Gan
- LMMs with Fine-Grained Grounding Capabilities, Haotian Zhang
- A Close Look at Vision in Large Multimodal Models, Jianwei Yang
- Multimodal Agents, Linjie Li
- Recent Advances in Image Generative Foundation Models, Zhengyuan Yang
- Video and 3D Generation, Kevin Lin
- [CVPR2023 Tutorial] Recent Advances in Vision Foundation Models
- Opening Remarks & Visual and Vision-Language Pre-training, Zhe Gan
- From Representation to Interface: The Evolution of Foundation for Vision Understanding, Jianwei Yang
- Alignments in Text-to-Image Generation, Zhengyuan Yang
- Large Multimodal Models, Chunyuan Li
- Multimodal Agents: Chaining Multimodal Experts with LLMs, Linjie Li
- [CVPR2022 Tutorial] Recent Advances in Vision-and-Language Pre-training
- [CVPR2021 Tutorial] From VQA to VLN: Recent Advances in Vision-and-Language Research
- [CVPR2020 Tutorial] Recent Advances in Vision-and-Language Research
Collection of works about Video-Language Pretraining, Video + LLMs, see Video for details
- Video Understanding
- Reading List
- Pretraining Tasks
- Datasets
- Pretraining Corpora
- Video Instructions
- Benchmarks
- Common Downstream Tasks
- Advanced Downstream Tasks
- Task-Specific Benchmarks
- Multifaceted Benchmarks
- Metrics
- Projects & Tools
- Video Generation
- Reading List
- Metrics
- Projects
Related Collections (datasets)
Related Collections (understanding)
- Awesome-LLMs-for-Video-Understanding , Latest Papers, Codes and Datasets on Vid-LLMs.
- Awesome Long-Term Video Understanding, Awesome papers & datasets specifically focused on long-term videos.
Related Collections (generation)
- i2vgen-xl , VGen is an open-source video synthesis codebase developed by the Tongyi Lab of Alibaba Group, featuring state-of-the-art video generative models.
Collection of works about 3D+LLM, see 3D for details
- Reading List
Related Collections
- awesome-3D-gaussian-splatting , A curated list of papers and open-source resources focused on 3D Gaussian Splatting, intended to keep pace with the anticipated surge of research in the coming months
- Awesome-LLM-3D , a curated list of Multi-modal Large Language Model in 3D world Resources
- Awesome-3D-Vision-and-Language , A curated list of research papers in 3D visual grounding
- awesome-scene-understanding , A list of awesome scene understanding papers.
Related Collections
- Awesome Document Understanding , A curated list of resources for Document Understanding (DU) topic related to Intelligent Document Processing (IDP), which is relative to Robotic Process Automation (RPA) from unstructured data, especially form Visually Rich Documents (VRDs).
Collection of existing popular vision encoder, see Vision Encoder for details
- Image Encoder
- Video Encoder
- Audio Encoder
Collection of works about audio+LLM, see Audio for details
- Reading List
Related Collections
- awesome-large-audio-models , Collection of resources on the applications of Large Language Models (LLMs) in Audio AI.
- speech-trident , Awesome speech/audio LLMs, representation learning, and codec models
- Audio-AI-Timeline , Here we will keep track of the latest AI models for waveform based audio generation, starting in 2023!
Collection of works about agent learning, see Agent for details
- Reading List
- Datasets & Benchmarks
- Projects
- Applications
Related Collections
- LLM-Agent-Paper-Digest , For benefiting the research community and promoting LLM-powered agent direction, we organize papers related to LLM-powered agent that published on top conferences recently
- LLMAgentPapers , Must-read Papers on Large Language Model Agents.
- LLM-Agent-Paper-List , In this repository, we provide a systematic and comprehensive survey on LLM-based agents, and list some must-read papers.
- XLang Paper Reading , Paper collection on building and evaluating language model agents via executable language grounding
- Awesome-LLMOps , An awesome & curated list of best LLMOps tools for developers
- Awesome LLM-Powered Agent , Awesome things about LLM-powered agents. Papers / Repos / Blogs / ...
- Awesome LMs with Tools , Language models (LMs) are powerful yet mostly for text-generation tasks. Tools have substantially enhanced their performance for tasks that require complex skills.
- ToolLearningPapers , Must-read papers on tool learning with foundation models
- Awesome-ALM , This repo collect research papers about leveraging the capabilities of language models, which can be a good reference for building upper-layer applications
- LLM-powered Autonomous Agents, Lil'Log, Overview: panning, memory, tool use
- World Model Papers, , Paper collections of the continuous effort start from World Models
Collection of works about robotics+LLM, see Robotic for details
- Reading List
Related Collections (Robotics)
- Awesome-Robotics-Foundation-Models , This is the partner repository for the survey paper "Foundation Models in Robotics: Applications, Challenges, and the Future". The authors hope this repository can act as a quick reference for roboticists who wish to read the relevant papers and implement the associated methods.
- Awesome-LLM-Robotics , This repo contains a curative list of papers using Large Language/Multi-Modal Models for Robotics/RL
- Simulately , a website where we gather useful information of physics simulator for cutting-edge robot learning research. It is still under active development, so stay tuned!
- Awesome-Temporal-Action-Detection-Temporal-Action-Proposal-Generation , Temporal Action Detection & Weakly Supervised & Semi Supervised Temporal Action Detection & Temporal Action Proposal Generation & Open-Vocabulary Temporal Action Detection.
- Awesome-TimeSeries-SpatioTemporal-LM-LLM , A professionally curated list of Large (Language) Models and Foundation Models (LLM, LM, FM) for Temporal Data (Time Series, Spatio-temporal, and Event Data) with awesome resources (paper, code, data, etc.), which aims to comprehensively and systematically summarize the recent advances to the best of our knowledge.
- PromptCraft-Robotics , The PromptCraft-Robotics repository serves as a community for people to test and share interesting prompting examples for large language models (LLMs) within the robotics domain
- Awesome-Robotics , A curated list of awesome links and software libraries that are useful for robots
Related Collections (embodied)
- Embodied_AI_Paper_List , Awesome Paper list for Embodied AI and its related projects and applications
- Awesome-Embodied-AI , A curated list of awesome papers on Embodied AI and related research/industry-driven resources
- awesome-embodied-vision , Reading list for research topics in embodied vision
Related Collections (autonomous driving)
- Awesome-LLM4AD , A curated list of awesome LLM for Autonomous Driving resources (continually updated)
Collection of works about Mathematics + LLMs, see AI4Math for details
- Reading List
Related Collections
- Awesome-Scientific-Language-Models , A curated list of pre-trained language models in scientific domains (e.g., mathematics, physics, chemistry, biology, medicine, materials science, and geoscience), covering different model sizes (from <100M to 70B parameters) and modalities (e.g., language, vision, molecule, protein, graph, and table)
Collection of works about LLM + ominous modality, see Ominous for details
Related Collections
- Reading List
- Dataset
- Benchmark
- Awesome-Unified-Multimodal-Models , This is a repository for organizing papers, codes and other resources related to unified multimodal models.
Please freely create a pull request or drop me an email: [email protected]