Topics about:
Text-to-Seq-Image
,Text-to-Video
This project is curated and maintained by Rui Sun and Yumin Zhang.
- LivePhoto: Real Image Animation with Text-guided Motion Control
Team: HKU, Alibaba Group, Ant Group.
Xi Chen, Zhiheng Liu, Mengting Chen, et al., Hengshuang Zhao
arXiv, 2023.12 [Paper], [PDF], [Code], [Demo (Video)], [Home Page] - Scalable Diffusion Models with Transformers
Sequential Images
Team: UC Berkeley, NYU.
William Peebles, Saining Xie
ICCV'23(Oral), arXiv, 2022.12 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
- Video generation models as world simulators
Team: Sora, Open AI.
Tim Brooks, Bill Peebles, Connor Homes, et al., Aditya Ramesh
online page, 2024.02 [Paper], [Home Page] - ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
Team: University of Waterloo.
Weiming Ren, Harry Yang, Ge Zhang, et al., Wenhu Chen
arXiv, 2024.02 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] - World Model on Million-Length Video And Language With RingAttention
Long Video
Team: UC Berkeley.
Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel
arXiv, 2024.02 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] - 360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model
Team: Peking University.
Qian Wang, Weiqi Li, Chong Mou, et al., Jian Zhang
arXiv, 2024.01 [Paper], [PDF], [Code], [Home Page] - MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation
Team: Bytedance Inc.
Weimin Wang, Jiawei Liu, Zhijie Lin, et al., Jiashi Feng
arXiv, 2024.01 [Paper], [PDF], [Home Page] - UniVG: Towards UNIfied-modal Video Generation
Team: Baidu Inc.
Ludan Ruan, Lei Tian, Chuanwei Huang, et al., Xinyan Xiao
arXiv, 2024.01 [Paper], [PDF], [Home Page] - VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM
Team: HiDream.ai Inc.
Fuchen Long, Zhaofan Qiu, Ting Yao and Tao Mei
arXiv, 2024.01 [Paper], [PDF], [Home Page] - VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
Team: Tencent AI Lab.
Haoxin Chen, Yong Zhang, Xiaodong Cun, et al., Ying Shan
arXiv, 2024.01 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] - Lumiere: A Space-Time Diffusion Model for Video Generation
Team: Google Research, Weizmann Institute, Tel-Aviv University, Technion.
Omer Bar-Tal, Hila Chefer, Omer Tov, et al., Inbar Mosseri
arXiv, 2024.01 [Paper], [PDF], [Home Page] - DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
Team: Fudan University, Alibaba Group, HUST, Zhejiang University.
Yujie Wei, Shiwei Zhang, Zhiwu Qing, et al., Hongming Shan
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page] - VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
Team: Peking University, Microsoft Research.
Wenjing Wang, Huan Yang, Zixi Tuo, et al., Jiaying Liu
arXiv, 2023.12 [Paper], [PDF] - TrailBlazer: Trajectory Control for Diffusion-Based Video Generation
Training-free
Team: Victoria University of Wellington, NVIDIA
Wan-Duo Kurt Ma, J.P. Lewis, W. Bastiaan Kleijn
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(video)] - FreeInit: Bridging Initialization Gap in Video Diffusion Models
Training-free
Team: Nanyang Technological University
Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(live)], [Demo(video)] - MTVG : Multi-text Video Generation with Text-to-Video Models
Training-free
Team: Korea University, NVIDIA
Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, et al., Sangpil Kim
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(video)] - A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Team: HUST, Alibaba Group, Zhejiang University, Ant Group
Xiang Wang, Shiwei Zhang, Hangjie Yuan, et al., Nong Sang
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page] - InstructVideo: Instructing Video Diffusion Models with Human Feedback
Team: Zhejiang University, Alibaba Group, Tsinghua University
Hangjie Yuan, Shiwei Zhang, Xiang Wang, et al., Dong Ni
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page] - VideoLCM: Video Latent Consistency Model
Team: HUST, Alibaba Group, SJTU
Xiang Wang, Shiwei Zhang, Han Zhang, et al., Nong Sang
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page] - Photorealistic Video Generation with Diffusion Models
Team: Stanford University Fei-Fei Li, Google.
Agrim Gupta, Lijun Yu, Kihyuk Sohn, et al., José Lezama
arXiv, 2023.12 [Paper], [PDF], [Home Page] - Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
Team: HUST, Alibaba Group, Fudan University.
Zhiwu Qing, Shiwei Zhang, Jiayu Wang, et al., Nong Sang
arXiv, 2023.12 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] - GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation
Team: HKU, Meta.
Shoufa Chen, Mengmeng Xu, Jiawei Ren, et al., Juan-Manuel Perez-Rua
arXiv, 2023.12 [Paper], [PDF], [Home Page] - StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter
Team: Tsinghua University, Tencent AI Lab, CUHK.
Gongye Liu, Menghan Xia, Yong Zhang, et al., Ying Shan
arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(live)] - GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
Multimodal
Team: Tencent.
Zhanyu Wang, Longyue Wang, Zhen Zhao, et al., Zhaopeng Tu
arXiv, 2023.11 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] - F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis
Training-free
Team: University of Electronic Science and Technology of China.
Sitong Su, Jianzhi Liu, Lianli Gao, Jingkuan Song
arXiv, 2023.11 [Paper], [PDF] - AdaDiff: Adaptive Step Selection for Fast Diffusion
Training-free
Team: Fudan University.
Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, Yu-Gang Jiang
arXiv, 2023.11 [Paper], [PDF] - FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax
Training-free
Team: University of Technology Sydney.
Yu Lu, Linchao Zhu, Hehe Fan, Yi Yang
arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page] - GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
Training-free
Team: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences.
Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, et al., Shifeng Chen
arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page] - MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
Team: University of Science and Technology of China, MSRA, Xi'an Jiaotong University.
Yanhui Wang, Jianmin Bao, Wenming Weng, et al., Baining Guo
arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(video)] - FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation
Team: University of Science and Technology of China, MSRA, Xi'an Jiaotong University.
Yuanxin Liu, Lei Li, Shuhuai Ren, et al., Lu Hou
arXiv, 2023.11 [Paper], [PDF], [Code], [Dataset] - ART⋅V: Auto-Regressive Text-to-Video Generation with Diffusion Models
Team: University of Science and Technology of China, Microsoft.
Wenming Weng, Ruoyu Feng, Yanhui Wang, et al., Zhiwei Xiong
arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page], [Demo(video)] - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Team: Stability AI.
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, et al., Robin Rombach
arXiv, 2023.11 [Paper], [PDF], [Code] - FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline
Team: Sber AI.
Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, et al., Denis Dimitrov
arXiv, 2023.11 [Paper], [PDF], [Code], [Home Page], [Demo(live)] - MoVideo: Motion-Aware Video Generation with Diffusion Models
Team: ETH, Meta.
Jingyun Liang, Yuchen Fan, Kai Zhang, et al., Rakesh Ranjan
arXiv, 2023.11 [Paper], [PDF], [Home Page] - Optimal Noise pursuit for Augmenting Text-to-Video Generation
Team: Zhejiang Lab.
Shijie Ma, Huayi Xu, Mengjian Li, et al., Yaxiong Wang
arXiv, 2023.11 [Paper], [PDF] - Make Pixels Dance: High-Dynamic Video Generation
Team: ByteDance.
Yan Zeng, Guoqiang Wei, Jiani Zheng, et al., Hang Li
arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(video)] - Learning Universal Policies via Text-Guided Video Generation
Team: MIT, Google DeepMind, UC Berkeley.
Yilun Du, Mengjiao Yang, Bo Dai, et al., Pieter Abbeel
NeurIPS'23 (Spotlight), arXiv, 2023.11 [Paper], [PDF], [Code], [Home Page] - Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
Team: Meta.
Rohit Girdhar, Mannat Singh, Andrew Brown, et al., Ishan Misra
arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(live)] - FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling
Training-free
Team: Nanyang Technological University.
Haonan Qiu, Menghan Xia, Yong Zhang, et al., Ziwei Liu
ICLR'24 arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page] - ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation
Training-free
Team: Shanghai Artificial Intelligence Laboratory.
Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao
arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page] - VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Team: Tencent AI Lab.
Haoxin Chen, Menghan Xia, Yingqing He, et al., Ying Shan
arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page] - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
Team: Shanghai Artificial Intelligence Laboratory.
Xinyuan Chen, Yaohui Wang, Lingjun Zhang, et al., Ziwei Liu
arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page] - DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
Team: The Chinese University of Hong Kong.
Jinbo Xing, Menghan Xia, Yong Zhang, et al., Ying Shan
arXiv, 2023.10 [Paper], [PDF], [Code], [Pretrained Model], [Home Page], [Demo(live)], [Demo(video)] - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
Team: Nankai University, MEGVII Technology.
Ruiqi Wu, Liangyu Chen, Tong Yang, et al., Xiangyu Zhang
arXiv, 2023.10 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] - LLM-grounded Video Diffusion Models
Training-free
Team: UC Berkeley.
Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li
arXiv, 2023.09 [Paper], [PDF], [Code(coming)], [Home Page] - VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
Team: UNC Chapel Hill.
Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal
arXiv, 2023.09 [Paper], [PDF], [Code] - VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation
Team: Baidu Inc.
Xin Li, Wenqing Chu, Ye Wu, et al., Jingdong Wang
arXiv, 2023.09 [Paper], [PDF], [Home Page] - LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
Team: Shanghai Artificial Intelligence Laboratory.
Yaohui Wang, Xinyuan Chen, Xin Ma, et al., Ziwei Liu
arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page] - Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
Team: Huawei.
Jiaxi Gu, Shicong Wang, Haoyu Zhao, et al., Hang Xu
arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page] - Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator
Training-free
Team: School of Information Science and Technology, ShanghaiTech University.
Hanzhuo Huang, Yufan Feng, Cheng Shi, et al., Sibei Yang
NeurIPS'24, arxiv, 2023.9[Paper], [PDF], [Home Page] - Show-1: Marrying pixel and latent diffusion models for text-to-video generation.
Team: Show Lab, National University of Singapor
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, et al., Mike Zheng Shou
arXiv, 2023.09 [Paper], [PDF], [Home Page],[Code], [Pretrained Model] - GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER
Team: Institute of Automation, Chinese Academy of Sciences (CASIA).
Mingzhen Sun, Weining Wang, Zihan Qin, et al., Jing Liu
NeurIPS'23, arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page], [[Demo(video)] - DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis
Training-free
Team: East China Normal University.
Zhongjie Duan, Lizhou You, Chengyu Wang, et al., Jun Huang
arXiv, 2023.08 [Paper], [PDF], [Home Page] - SimDA: Simple Diffusion Adapter for Efficient Video Generation
Team: Fudan University, Microsoft.
Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang
arXiv, 2023.08 [Paper], [PDF], [Code (Coming)], [Home Page] - Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models
Team: National University of Singapore.
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua
arXiv, 2023.08 [Paper], [PDF], [Code] - ModelScope Text-to-Video Technical Report
Team: Alibaba Group.
Jiuniu Wang, Hangjie Yuan, Dayou Chen, et al., Shiwei Zhang
arXiv, 2023.08 [Paper], [PDF], [Code], [Home Page], [[Demo(live)] - Dual-Stream Diffusion Net for Text-to-Video Generation
Team: Nanjing University of Science and Technology.
Binhui Liu, Xin Liu, Anbo Dai, et al., Jian Yang
arXiv, 2023.08 [Paper], [PDF] - AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Team: The Chinese University of Hong Kong.
Yuwei Guo, Ceyuan Yang, Anyi Rao, et al., Bo Dai
ICLR'24 (spotlight), arXiv, 2023.07 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation
Team: HKUST.
Yingqing He, Menghan Xia, Haoxin Chen, et al., Qifeng Chen
arXiv, 2023.07 [Paper], [PDF], [Code], [Home Page], [[Demo(video)] - Probabilistic Adaptation of Text-to-Video Models
Team: Google, UC Berkeley.
Mengjiao Yang, Yilun Du, Bo Dai, et al., Pieter Abbeel
arXiv, 2023.06 [Paper], [PDF], [Home Page] - ED-T2V: An Efficient Training Framework for Diffusion-based Text-to-Video Generation
Team: School of Artificial Intelligence, University of Chinese Academy of Sciences.
Jiawei Liu, Weining Wang, Wei Liu, Qian He, Jing Liu
IJCNN'23, 2023.06 [Paper], [PDF] - Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance
Team: CUHK.
Jinbo Xing, Menghan Xia, Yuxin Liu, et al., Tien-Tsin Wong
arXiv, 2023.06 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] - VideoComposer: Compositional Video Synthesis with Motion Controllability
Team: Alibaba Group.
Xiang Wang, Hangjie Yuan, Shiwei Zhang, et al., Jingren Zhou
NeurIPS'23, arXiv, 2023.06 [Paper], [PDF], [Code], [Pretrained Model], [Home Page] - VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation
Team: University of Chinese Academy of Sciences (UCAS), Alibaba Group.
Zhengxiong Luo, Dayou Chen, Yingya Zhang, et al., Tieniu Tan
CVPR'23, arXiv, 2023.06 [Paper], [PDF] - DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation
Training-free
Team: Korea University.
Susung Hong, Junyoung Seo, Heeseong Shin, Sunghwan Hong, Seungryong Kim
arXiv, 2023.05 [Paper], [PDF] - Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models
Team: Carnegie Mellon Univeristy.
Rohan Dhesikan, Vignesh Rajmohan
arXiv, 2023.05 [Paper], [PDF], [Code(coming)] - Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
Team: University of Maryland.
Songwei Ge, Seungjun Nah, Guilin Liu, et al., Yogesh Balaji
ICCV'23, arXiv, 2023.05 [Paper], [PDF], [Home Page] - Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity
Team: NUS, CUHK.
Zijiao Chen, Jiaxin Qing, Juan Helen Zhou
NeurIPS'24, arXiv, 2023.05 [Paper], [PDF], [Code], [Home Page] - VideoPoet: A Large Language Model for Zero-Shot Video Generation
Team: Google Research
Dan Kondratyuk, Lijun Yu, Xiuye Gu, et al., Lu Jiang
arXiv, 2023.05 [Paper], [PDF], [Home Page], [Blog] - VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning
Team: Tsinghua University, Beijing Film Academy
Hong Chen, Xin Wang, Guanning Zeng, et al., WenwuZhu
arXiv, 2023.05 [Paper], [PDF], [Code], [Home Page] - Text2Performer: Text-Driven Human Video Generation
Team: Nanyang Technological University
Yuming Jiang, Shuai Yang, Tong Liang Koh, et al., Ziwei Liu
arXiv, 2023.04 [Paper], [PDF], [Code], [Home Page], [[Demo(video)] - Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation
Team: University of Rochester, Meta.
Jie An, Songyang Zhang, Harry Yang, et al., Xi Yin
arXiv, 2023.04 [Paper], [PDF], [Home Page] - Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
Team: NVIDIA.
Andreas Blattmann, Robin Rombach, Huan Ling, et al., Karsten Kreis
CVPR'23, arXiv, 2023.04 [Paper], [PDF], [Home Page] - NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
Team: University of Science and Technology of China, Microsoft.
Shengming Yin, Chenfei Wu, Huan Yang, et al. , Nan Duan
arXiv, 2023.03 [Paper], [PDF], [Home Page] - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
Team: Picsart AI Resarch (PAIR).
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, et al., Humphrey Shi
arXiv, 2023.03 [Paper], [PDF], [Code], [Home Page], [Demo(live)], [Demo(video)] - Structure and Content-Guided Video Synthesis with Diffusion Models
Team: Runway
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, Anastasis Germanidis
ICCV'23, arXiv, 2023.02 [Paper], [PDF], [Home Page] - SceneScape: Text-Driven Consistent Scene Generation
Team: Weizmann Institute of Science, NVIDIA Research
Rafail Fridman, Amit Abecasis, Yoni Kasten, Tali Dekel
NeurIPS'23, arXiv, 2023.02 [Paper], [PDF], [Code], [Home Page] - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
Team: Renmin University of China, Peking University, Microsoft Research
Ludan Ruan, Yiyang Ma, Huan Yang, et al., Baining Guo
CVPR'23, arXiv, 2022.12 [Paper], [PDF], [Code] - Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Team: Show Lab, National University of Singapore.
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Mike Zheng Shou et al
ICCV'23, arxiv, 2022.12[Paper], [PDF], [Code], [Pretrained Model] - MagicVideo: Efficient Video Generation With Latent Diffusion Models
Team: ByteDance Inc.
Daquan Zhou, Weimin Wang, Hanshu Yan, et al., Jiashi Feng
arXiv, 2022.11 [Paper], [PDF], [Home Page] - Latent Video Diffusion Models for High-Fidelity Long Video Generation
Long Video
Team: HKUST, Tencent AI Lab.
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen
arXiv, 2022.10 [Paper], [PDF], [Code], [Home Page] - Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
Team: UC Santa Barbara, Meta.
Tsu-Jui Fu, Licheng Yu, Ning Zhang, et al., Sean Bell
CVPR'23, arXiv, 2022.11 [Paper], [PDF] - Phenaki: Variable Length Video Generation From Open Domain Textual Description
Team: Google.
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, et al., Dumitru Erhan
ICLR'23, arXiv, 2022.10 [Paper], [PDF], [Home Page] - Imagen Video: High Definition Video Generation with Diffusion Models
Team: Google.
Jonathan Ho, William Chan, Chitwan Saharia, et al., Tim Salimans
arXiv, 2022.10 [Paper], [PDF], [Home Page] - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation
Story Visualization
Team: UNC Chapel Hill.
Adyasha Maharana, Darryl Hannan, Mohit Bansal
ECCV'22, arXiv, 2022.09 [Paper], [PDF], [Code], [Demo(live)] - Make-A-Video: Text-to-Video Generation without Text-Video Data
Team: Meta AI.
Uriel Singer, Adam Polyak, Thomas Hayes, et al., Yaniv Taigman
ICLR'23, arXiv, 2022.09 [Paper], [PDF], [Code] - MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
Team: S-Lab, SenseTime.
Mingyuan Zhang, Zhongang Cai, Liang Pan, et al., Ziwei Liu
TPAMI'24, arxiv, 2022.08 [Paper], [PDF], [Code], [Home Page], [Demo] - Word-Level Fine-Grained Story Visualization
Story Visualization
Team: University of Oxford.
Bowen Li, Thomas Lukasiewicz
ECCV'22, arXiv, 2022.08 [Paper], [PDF], [Code], [Pretrained Model] - CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Team: Tsinghua University.
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang
ICLR'23, arXiv, 2022.05 [Paper], [PDF], [Code], [Home Page], [Demo(video)] - CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
Team: Tsinghua University.
Ming Ding, Wendi Zheng, Wenyi Hong, Jie Tang
NeurIPS'22, arXiv, 2022.04 [Paper], [PDF], [Code], [Home Page] - Long video generation with time-agnostic vqgan and time-sensitive transformer
Team: Meta AI.
Songwei Ge, Thomas Hayes, Harry Yang, et al., Devi Parikh
ECCV'22 arXiv, 2022.04 [Paper], [PDF], [Home Page], [Code] - Video Diffusion Models
text-conditioned
Team: Google.
Jonathan Ho, Tim Salimans, Alexey Gritsenko, et al., David J. Fleet
arXiv, 2022.04 [Paper], [PDF], [Home Page] - NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
Long Video
Team: Microsoft.
Chenfei Wu, Jian Liang, Xiaowei Hu, et al., Nan Duan
NeurIPS'22, arXiv, 2022.02 [Paper], [PDF], [Code], [Home Page] - NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
Team: Microsoft.
Chenfei Wu, Jian Liang, Lei Ji, et al., Nan Duan
ECCV'22, arXiv, 2021.11 [Paper], [PDF], [Code] - GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions
Team: Microsoft, Duke University.
Chenfei Wu, Lun Huang, Qianxi Zhang, et al., Nan Duan
arXiv, 2021.04 [Paper], [PDF] - Cross-Modal Dual Learning for Sentence-to-Video Generation
Team: Tsinghua University.
Yue Liu, Xin Wang, Yitian Yuan, Wenwu Zhu
ACM MM'19 [Paper], [PDF] - IRC-GAN: introspective recurrent convolutional GAN for text-to-video generation
Team: Peking University.
Kangle Deng, Tianyi Fei, Xin Huang, Yuxin Peng
IJCAI'19 [Paper], [PDF] - Imagine this! scripts to compositions to videos
Team: University of Illinois Urbana-Champaign, AI2, University of Washington.
Tanmay Gupta, Dustin Schwenk, Ali Farhadi, et al., Aniruddha Kembhavi
ECCV'18, arxiv, 2018.04 [Paper], [PDF] - To Create What You Tell: Generating Videos from Captions
Team: USTC, Microsoft Research.
Yingwei Pan, Zhaofan Qiu, Ting Yao, et al., Tao Mei
ACM MM'17, arxiv, 2018.04 [Paper], [PDF] - Neural Discrete Representation Learning.
Team: DeepMind.
Aaron van den Oord, Oriol Vinyals, Dinghan Shen, Koray Kavukcuoglu
NeurIPS'17, arxiv, 2017.11 [Paper], [PDF] - Video Generation From Text.
Team: Duke University, NEC Labs America.
Yitong Li, Martin Renqiang Min, Dinghan Shen, et al., Lawrence Carin
AAAI'18, arxiv, 2017.10 [Paper], [PDF] - Attentive semantic video generation using captions.
Team: IIT Hyderabad.
Tanya Marwah, Gaurav Mittal, Vineeth N. Balasubramanian
ICCV'17, arxiv, 2017.08 [Paper], [PDF] - Sync-DRAW: Automatic Video Generation using Deep Recurrent Attentive Architectures
VAE
Team: IIT Hyderabad.
Gaurav Mittal, Tanya Marwah, Vineeth N. Balasubramanian
ACM MM'17, arXiv, 2016.11 [Paper], [PDF]
Datasets are divided according to their collected domains: Face, Open, Movie, Action, Instruct
.
Metrics are divided as image-level, video-level
.
-
(CV-Text) Celebv-text: A large-scale facial text-video datase
Dataset (Domain:Face)
Team: University of Sydney, SenseTime Research.
Jianhui Yu, Hao Zhu, Liming Jiang, et al., Wayne Wu
CVPR'23, arXiv, 2023.03 [Paper], [PDF], [Code], [Demo], [Home Page] -
(MSR-VTT) Msr-vtt: A large video description dataset for bridging video and language
Dataset (Domain:Open)
Team: Microsoft Research.
Jun Xu , Tao Mei , Ting Yao and Yong Rui
CVPR'16 [Paper], [PDF] -
(DideMo) Localizing moments in video with natural language
Dataset (Domain:Open)
Team: UC Berkeley, Adobe
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, et al., Bryan Russell
ICCV'17, arXiv, 2017.08 [Paper], [PDF] -
(YT-Tem-180M) Merlot: Multimodal neural script knowledge models
Dataset (Domain:Open)
Team: University of Washington
Rowan Zellers, Ximing Lu, Jack Hessel, et al., Yejin Choi
NeurIPS'21, arXiv, 2021.06 [Paper], [PDF], [Code], [Home Page] -
(WebVid2M) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Dataset (Domain:Open)
Team: University of Oxford, CNRS.
Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman
ICCV'21, arXiv, 2021.04 [Paper], [PDF],[Dataset], [Code],[Demo], [Home Page] -
(HD-VILA-100M) Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
Dataset (Domain:Open)
Team: Microsoft Research Asia.
Hongwei Xue, Tiankai Hang, Yanhong Zeng, et al., Baining Guo
CVPR'22, arXiv, 2021.11 [Paper], [PDF], [Code] -
(InterVid) Internvid: A large-scale video-text dataset for multimodal understanding and generation
Dataset (Domain:Open)
Team: Shanghai AI Laboratory.
Yi Wang, Yinan He, Yizhuo Li, et al., Yu Qiao
arXiv, 2023.07 [Paper], [PDF], [Code] -
(HD-VG-130M) VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
Dataset (Domain:Open)
Team: Peking University, Microsoft Research.
Wenjing Wang, Huan Yang, Zixi Tuo, et al., Jiaying Liu
arXiv, 2023.05 [Paper], [PDF] -
(Youku-mPLUG) Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks
Dataset (Domain:Open)
Team: DAMO Academy, Alibaba Group.
Haiyang Xu, Qinghao Ye, Xuan Wu, et al., Fei Huang
arXiv, 2023.06 [Paper], [PDF] -
(VAST-27M) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset
Dataset (Domain:Open)
Team: UCAS, CAS
Sihan Chen, Handong Li, Qunbo Wang, et al., Jing Liu
NeurIPS'23, arXiv, 2023.05 [Paper], [PDF] -
(Panda-70M) Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Dataset (Domain:Open)
Team: Snap Inc., University of California, University of Trento.
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Sergey Tulyakov
arXiv, 2024.02 [Paper], [PDF], [Code], [Home Page] -
(LSMDC) Movie description
Dataset (Domain:Movie)
Team: Max Planck Institute for Informatics.
Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, et al., Bernt Schiele
IJCV'17, arXiv, 2016.05 [Paper], [PDF], [Home Page] -
(MAD) Mad: A scalable dataset for language grounding in videos from movie audio descriptions
Dataset (Domain:Movie)
Team: KAUST, Adobe Research.
Mattia Soldan, Alejandro Pardo, Juan León Alcázar, et al., Bernard Ghanem
CVPR'22, arXiv, 2021.12 [Paper], [PDF], [Code] -
(UCF-101) UCF101: A dataset of 101 human actions classes from videos in the wild
Dataset (Domain:Action)
Team: University of Central Florida.
Khurram Soomro, Amir Roshan Zamir, Mubarak Shah
arXiv, 2012.12 [Paper], [PDF], [Data] -
(ActNet-200) Activitynet: A large-scale video benchmark for human activity understanding
Dataset (Domain:Action)
Team: Universidad del Norte, KAUST
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, Juan Carlos Niebles
CVPR'15, [Paper], [PDF], [Home Page] -
(Charades) Hollywood in homes: Crowdsourcing data collection for activity understanding
Dataset (Domain:Action)
Team: Carnegie Mellon University
Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, et al., Abhinav Gupta
ECCV'16, arXiv, 2016.04, [Paper], [PDF], [Home Page] -
(Kinetics) The kinetics human action video dataset
Dataset (Domain:Action)
Team: Google
Will Kay, Joao Carreira, Karen Simonyan, et al., Andrew Zisserman
arXiv, 2017.05, [Paper], [PDF], [Home Page] -
(ActivityNet) Dense-captioning events in videos
Dataset (Domain:Action)
Team: Stanford University
Ranjay Krishna, Kenji Hata, Frederic Ren, et al., Juan Carlos Niebles
ICCV'17, arXiv, 2017.05, [Paper], [PDF], [Home Page] -
(Charades-Ego) Charades-ego: A large-scale dataset of paired third and first person videos
Dataset (Domain:Action)
Team: Carnegie Mellon University
Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, et al., Karteek Alahari
arXiv, 2018.04, [Paper], [PDF], [Home Page] -
(SS-V2) The "something something" video database for learning and evaluating visual common sense
Dataset (Domain:Action)
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, et al., Roland Memisevic
ICCV'17, arXiv, 2017.06 [Paper], [PDF], [Home Page] -
(How2) How2: a large-scale dataset for multimodal language understanding
Dataset (Domain:Instruct)
Team: Carnegie Mellon University.
Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, et al., Florian Metze
arXiv, 2018.11 [Page], [PDF] -
(HowTo100M) HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Dataset (Domain:Instruct)
Team: Ecole Normale Superieure, Inria, CIIRC.
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, et al., Josef Sivic
arXiv, 2019.06 [Page], [PDF], [Home Page] -
(YouCook2) Towards automatic learning of procedures from web instructional video
Dataset (Domain:Cooking)
Team: University of Michigan, University of Rochester
Luowei Zhou, Chenliang Xu, Jason J. Corso
AAAI'18, arXiv, 2017.03 , [Paper], [PDF],[Home Page] -
(Epic-Kichens) Scaling egocentric vision: The epic-kitchens dataset
Dataset (Domain:Cookding)
Team: Uni. of Bristol, Uni. of Catania, Uni. of Toronto.
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, et al., Michael Wray
ECCV'18, arXiv, 2018.04, [Paper], [PDF], [Home Page] -
(PSNR/SSIM) Image quality assessment: from error visibility to structural similarity
Metric (image-level)
Team: New York University.
Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, E.P. Simoncelli
IEEE TIP, 2004.04. [Paper], [PDF] -
(IS) Improved techniques for training gans
Metric (image-level)
Team: OpenAI
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, et al., Xi Chen
NeurIPS'16, arXiv, 2016.06, [Paper], [PDF], [Code] -
(FID) Gans trained by a two time-scale update rule converge to a local nash equilibrium
Metric (image-level)
Team: Johannes Kepler University Linz
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, et al., Sepp Hochreiter
NeurIPS'17, arXiv, 2017.06 [Paper], [PDF] -
(CLIP Score) Learning transferable visual models from natural language supervision
Metric (image-level)
Team: OpenAI.
Alec Radford, Jong Wook Kim, Chris Hallacy, et al., Ilya Sutskever
ICML'21, arXiv, 2021.02 [Paper], [PDF], [Code] -
(Video IS) Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan
Metric (video-level)
Masaki Saito, Shunta Saito, Masanori Koyama, Sosuke Kobayashi
IJCV'20, arXiv, 2018.11 [Paper], [PDF], [Code] -
(FVD/KVD) FVD: A new metric for video generation
Metric (video-level)
Team: Johannes Kepler University, Google
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, et al., Sylvain Gelly
ICLR'19, arXiv, 2018.12 [Paper], [PDF], [Code] -
(FCS) Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Metric (video-level)
Team: Show Lab, National University of Singapore.
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Mike Zheng Shou et al
ICCV'23, arxiv, 2022.12[Paper], [PDF], [Code], [Pretrained Model]
If you find this repository useful, please consider citing this list:
@misc{rui2024t2vgenerationlist,
title = {Awesome-Text-to-Video-Generation},
author = {Rui Sun, Yumin Zhang},
journal = {GitHub repository},
url = {https://github.com/soraw-ai/Awesome-Text-to-Video-Generation},
year = {2024},
}