Skip to content

haoranD/Awesome-Text-to-Video-Generation

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 

Repository files navigation

Awesome-Text-to-Video-Generation Awesome

Topics about:
Text-to-Seq-Image, Text-to-Video

This project is curated and maintained by Rui Sun and Yumin Zhang.

Table of Content

Text-to-Seq-Image

  • LivePhoto: Real Image Animation with Text-guided Motion Control
    Team: HKU, Alibaba Group, Ant Group.
    Xi Chen, Zhiheng Liu, Mengting Chen, et al., Hengshuang Zhao
    arXiv, 2023.12 [Paper], [PDF], [Code], [Demo (Video)], [Home Page]
  • Scalable Diffusion Models with Transformers Sequential Images
    Team: UC Berkeley, NYU.
    William Peebles, Saining Xie
    ICCV'23(Oral), arXiv, 2022.12 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]

Text-to-Video

  • Video generation models as world simulators
    Team: Sora, Open AI.
    Tim Brooks, Bill Peebles, Connor Homes, et al., Aditya Ramesh
    online page, 2024.02 [Paper], [Home Page]
  • ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
    Team: University of Waterloo.
    Weiming Ren, Harry Yang, Ge Zhang, et al., Wenhu Chen
    arXiv, 2024.02 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • World Model on Million-Length Video And Language With RingAttention Long Video
    Team: UC Berkeley.
    Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel
    arXiv, 2024.02 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • 360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model
    Team: Peking University.
    Qian Wang, Weiqi Li, Chong Mou, et al., Jian Zhang
    arXiv, 2024.01 [Paper], [PDF], [Code], [Home Page]
  • MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation
    Team: Bytedance Inc.
    Weimin Wang, Jiawei Liu, Zhijie Lin, et al., Jiashi Feng
    arXiv, 2024.01 [Paper], [PDF], [Home Page]
  • UniVG: Towards UNIfied-modal Video Generation
    Team: Baidu Inc.
    Ludan Ruan, Lei Tian, Chuanwei Huang, et al., Xinyan Xiao
    arXiv, 2024.01 [Paper], [PDF], [Home Page]
  • VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM
    Team: HiDream.ai Inc.
    Fuchen Long, Zhaofan Qiu, Ting Yao and Tao Mei
    arXiv, 2024.01 [Paper], [PDF], [Home Page]
  • VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
    Team: Tencent AI Lab.
    Haoxin Chen, Yong Zhang, Xiaodong Cun, et al., Ying Shan
    arXiv, 2024.01 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • Lumiere: A Space-Time Diffusion Model for Video Generation
    Team: Google Research, Weizmann Institute, Tel-Aviv University, Technion.
    Omer Bar-Tal, Hila Chefer, Omer Tov, et al., Inbar Mosseri
    arXiv, 2024.01 [Paper], [PDF], [Home Page]
  • DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
    Team: Fudan University, Alibaba Group, HUST, Zhejiang University.
    Yujie Wei, Shiwei Zhang, Zhiwu Qing, et al., Hongming Shan
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page]
  • VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
    Team: Peking University, Microsoft Research.
    Wenjing Wang, Huan Yang, Zixi Tuo, et al., Jiaying Liu
    arXiv, 2023.12 [Paper], [PDF]
  • TrailBlazer: Trajectory Control for Diffusion-Based Video Generation Training-free
    Team: Victoria University of Wellington, NVIDIA
    Wan-Duo Kurt Ma, J.P. Lewis, W. Bastiaan Kleijn
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(video)]
  • FreeInit: Bridging Initialization Gap in Video Diffusion Models Training-free
    Team: Nanyang Technological University
    Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(live)], [Demo(video)]
  • MTVG : Multi-text Video Generation with Text-to-Video Models Training-free
    Team: Korea University, NVIDIA
    Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, et al., Sangpil Kim
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(video)]
  • A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
    Team: HUST, Alibaba Group, Zhejiang University, Ant Group
    Xiang Wang, Shiwei Zhang, Hangjie Yuan, et al., Nong Sang
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page]
  • InstructVideo: Instructing Video Diffusion Models with Human Feedback
    Team: Zhejiang University, Alibaba Group, Tsinghua University
    Hangjie Yuan, Shiwei Zhang, Xiang Wang, et al., Dong Ni
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page]
  • VideoLCM: Video Latent Consistency Model
    Team: HUST, Alibaba Group, SJTU
    Xiang Wang, Shiwei Zhang, Han Zhang, et al., Nong Sang
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page]
  • Photorealistic Video Generation with Diffusion Models
    Team: Stanford University Fei-Fei Li, Google.
    Agrim Gupta, Lijun Yu, Kihyuk Sohn, et al., José Lezama
    arXiv, 2023.12 [Paper], [PDF], [Home Page]
  • Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
    Team: HUST, Alibaba Group, Fudan University.
    Zhiwu Qing, Shiwei Zhang, Jiayu Wang, et al., Nong Sang
    arXiv, 2023.12 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation
    Team: HKU, Meta.
    Shoufa Chen, Mengmeng Xu, Jiawei Ren, et al., Juan-Manuel Perez-Rua
    arXiv, 2023.12 [Paper], [PDF], [Home Page]
  • StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter
    Team: Tsinghua University, Tencent AI Lab, CUHK.
    Gongye Liu, Menghan Xia, Yong Zhang, et al., Ying Shan
    arXiv, 2023.12 [Paper], [PDF], [Code], [Home Page], [Demo(live)]
  • GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation Multimodal
    Team: Tencent.
    Zhanyu Wang, Longyue Wang, Zhen Zhao, et al., Zhaopeng Tu
    arXiv, 2023.11 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis Training-free
    Team: University of Electronic Science and Technology of China.
    Sitong Su, Jianzhi Liu, Lianli Gao, Jingkuan Song
    arXiv, 2023.11 [Paper], [PDF]
  • AdaDiff: Adaptive Step Selection for Fast Diffusion Training-free
    Team: Fudan University.
    Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, Yu-Gang Jiang
    arXiv, 2023.11 [Paper], [PDF]
  • FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax Training-free
    Team: University of Technology Sydney.
    Yu Lu, Linchao Zhu, Hehe Fan, Yi Yang
    arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page]
  • GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning Training-free
    Team: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences.
    Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, et al., Shifeng Chen
    arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page]
  • MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
    Team: University of Science and Technology of China, MSRA, Xi'an Jiaotong University.
    Yanhui Wang, Jianmin Bao, Wenming Weng, et al., Baining Guo
    arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(video)]
  • FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation
    Team: University of Science and Technology of China, MSRA, Xi'an Jiaotong University.
    Yuanxin Liu, Lei Li, Shuhuai Ren, et al., Lu Hou
    arXiv, 2023.11 [Paper], [PDF], [Code], [Dataset]
  • ART⋅V: Auto-Regressive Text-to-Video Generation with Diffusion Models
    Team: University of Science and Technology of China, Microsoft.
    Wenming Weng, Ruoyu Feng, Yanhui Wang, et al., Zhiwei Xiong
    arXiv, 2023.11 [Paper], [PDF], [Code(coming)], [Home Page], [Demo(video)]
  • Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
    Team: Stability AI.
    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, et al., Robin Rombach
    arXiv, 2023.11 [Paper], [PDF], [Code]
  • FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline
    Team: Sber AI.
    Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, et al., Denis Dimitrov
    arXiv, 2023.11 [Paper], [PDF], [Code], [Home Page], [Demo(live)]
  • MoVideo: Motion-Aware Video Generation with Diffusion Models
    Team: ETH, Meta.
    Jingyun Liang, Yuchen Fan, Kai Zhang, et al., Rakesh Ranjan
    arXiv, 2023.11 [Paper], [PDF], [Home Page]
  • Optimal Noise pursuit for Augmenting Text-to-Video Generation
    Team: Zhejiang Lab.
    Shijie Ma, Huayi Xu, Mengjian Li, et al., Yaxiong Wang
    arXiv, 2023.11 [Paper], [PDF]
  • Make Pixels Dance: High-Dynamic Video Generation
    Team: ByteDance.
    Yan Zeng, Guoqiang Wei, Jiani Zheng, et al., Hang Li
    arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(video)]
  • Learning Universal Policies via Text-Guided Video Generation
    Team: MIT, Google DeepMind, UC Berkeley.
    Yilun Du, Mengjiao Yang, Bo Dai, et al., Pieter Abbeel
    NeurIPS'23 (Spotlight), arXiv, 2023.11 [Paper], [PDF], [Code], [Home Page]
  • Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
    Team: Meta.
    Rohit Girdhar, Mannat Singh, Andrew Brown, et al., Ishan Misra
    arXiv, 2023.11 [Paper], [PDF], [Home Page], [Demo(live)]
  • FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling Training-free
    Team: Nanyang Technological University.
    Haonan Qiu, Menghan Xia, Yong Zhang, et al., Ziwei Liu
    ICLR'24 arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page]
  • ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation Training-free
    Team: Shanghai Artificial Intelligence Laboratory.
    Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao
    arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page]
  • VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
    Team: Tencent AI Lab.
    Haoxin Chen, Menghan Xia, Yingqing He, et al., Ying Shan
    arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page]
  • SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
    Team: Shanghai Artificial Intelligence Laboratory.
    Xinyuan Chen, Yaohui Wang, Lingjun Zhang, et al., Ziwei Liu
    arXiv, 2023.10 [Paper], [PDF], [Code], [Home Page]
  • DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
    Team: The Chinese University of Hong Kong.
    Jinbo Xing, Menghan Xia, Yong Zhang, et al., Ying Shan
    arXiv, 2023.10 [Paper], [PDF], [Code], [Pretrained Model], [Home Page], [Demo(live)], [Demo(video)]
  • LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
    Team: Nankai University, MEGVII Technology.
    Ruiqi Wu, Liangyu Chen, Tong Yang, et al., Xiangyu Zhang
    arXiv, 2023.10 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • LLM-grounded Video Diffusion Models Training-free
    Team: UC Berkeley.
    Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li
    arXiv, 2023.09 [Paper], [PDF], [Code(coming)], [Home Page]
  • VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
    Team: UNC Chapel Hill.
    Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal
    arXiv, 2023.09 [Paper], [PDF], [Code]
  • VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation
    Team: Baidu Inc.
    Xin Li, Wenqing Chu, Ye Wu, et al., Jingdong Wang
    arXiv, 2023.09 [Paper], [PDF], [Home Page]
  • LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
    Team: Shanghai Artificial Intelligence Laboratory.
    Yaohui Wang, Xinyuan Chen, Xin Ma, et al., Ziwei Liu
    arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page]
  • Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
    Team: Huawei.
    Jiaxi Gu, Shicong Wang, Haoyu Zhao, et al., Hang Xu
    arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page]
  • Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator Training-free
    Team: School of Information Science and Technology, ShanghaiTech University.
    Hanzhuo Huang, Yufan Feng, Cheng Shi, et al., Sibei Yang
    NeurIPS'24, arxiv, 2023.9[Paper], [PDF], [Home Page]
  • Show-1: Marrying pixel and latent diffusion models for text-to-video generation.
    Team: Show Lab, National University of Singapor
    David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, et al., Mike Zheng Shou
    arXiv, 2023.09 [Paper], [PDF], [Home Page],[Code], [Pretrained Model]
  • GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER
    Team: Institute of Automation, Chinese Academy of Sciences (CASIA).
    Mingzhen Sun, Weining Wang, Zihan Qin, et al., Jing Liu
    NeurIPS'23, arXiv, 2023.09 [Paper], [PDF], [Code], [Home Page], [[Demo(video)]
  • DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis Training-free
    Team: East China Normal University.
    Zhongjie Duan, Lizhou You, Chengyu Wang, et al., Jun Huang
    arXiv, 2023.08 [Paper], [PDF], [Home Page]
  • SimDA: Simple Diffusion Adapter for Efficient Video Generation
    Team: Fudan University, Microsoft.
    Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang
    arXiv, 2023.08 [Paper], [PDF], [Code (Coming)], [Home Page]
  • Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models
    Team: National University of Singapore.
    Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua
    arXiv, 2023.08 [Paper], [PDF], [Code]
  • ModelScope Text-to-Video Technical Report
    Team: Alibaba Group.
    Jiuniu Wang, Hangjie Yuan, Dayou Chen, et al., Shiwei Zhang
    arXiv, 2023.08 [Paper], [PDF], [Code], [Home Page], [[Demo(live)]
  • Dual-Stream Diffusion Net for Text-to-Video Generation
    Team: Nanjing University of Science and Technology.
    Binhui Liu, Xin Liu, Anbo Dai, et al., Jian Yang
    arXiv, 2023.08 [Paper], [PDF]
  • AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
    Team: The Chinese University of Hong Kong.
    Yuwei Guo, Ceyuan Yang, Anyi Rao, et al., Bo Dai
    ICLR'24 (spotlight), arXiv, 2023.07 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation
    Team: HKUST.
    Yingqing He, Menghan Xia, Haoxin Chen, et al., Qifeng Chen
    arXiv, 2023.07 [Paper], [PDF], [Code], [Home Page], [[Demo(video)]
  • Probabilistic Adaptation of Text-to-Video Models
    Team: Google, UC Berkeley.
    Mengjiao Yang, Yilun Du, Bo Dai, et al., Pieter Abbeel
    arXiv, 2023.06 [Paper], [PDF], [Home Page]
  • ED-T2V: An Efficient Training Framework for Diffusion-based Text-to-Video Generation
    Team: School of Artificial Intelligence, University of Chinese Academy of Sciences.
    Jiawei Liu, Weining Wang, Wei Liu, Qian He, Jing Liu
    IJCNN'23, 2023.06 [Paper], [PDF]
  • Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance
    Team: CUHK.
    Jinbo Xing, Menghan Xia, Yuxin Liu, et al., Tien-Tsin Wong
    arXiv, 2023.06 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • VideoComposer: Compositional Video Synthesis with Motion Controllability
    Team: Alibaba Group.
    Xiang Wang, Hangjie Yuan, Shiwei Zhang, et al., Jingren Zhou
    NeurIPS'23, arXiv, 2023.06 [Paper], [PDF], [Code], [Pretrained Model], [Home Page]
  • VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation
    Team: University of Chinese Academy of Sciences (UCAS), Alibaba Group.
    Zhengxiong Luo, Dayou Chen, Yingya Zhang, et al., Tieniu Tan
    CVPR'23, arXiv, 2023.06 [Paper], [PDF]
  • DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation Training-free
    Team: Korea University.
    Susung Hong, Junyoung Seo, Heeseong Shin, Sunghwan Hong, Seungryong Kim
    arXiv, 2023.05 [Paper], [PDF]
  • Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models
    Team: Carnegie Mellon Univeristy.
    Rohan Dhesikan, Vignesh Rajmohan
    arXiv, 2023.05 [Paper], [PDF], [Code(coming)]
  • Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
    Team: University of Maryland.
    Songwei Ge, Seungjun Nah, Guilin Liu, et al., Yogesh Balaji
    ICCV'23, arXiv, 2023.05 [Paper], [PDF], [Home Page]
  • Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity
    Team: NUS, CUHK.
    Zijiao Chen, Jiaxin Qing, Juan Helen Zhou
    NeurIPS'24, arXiv, 2023.05 [Paper], [PDF], [Code], [Home Page]
  • VideoPoet: A Large Language Model for Zero-Shot Video Generation
    Team: Google Research
    Dan Kondratyuk, Lijun Yu, Xiuye Gu, et al., Lu Jiang
    arXiv, 2023.05 [Paper], [PDF], [Home Page], [Blog]
  • VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning
    Team: Tsinghua University, Beijing Film Academy
    Hong Chen, Xin Wang, Guanning Zeng, et al., WenwuZhu
    arXiv, 2023.05 [Paper], [PDF], [Code], [Home Page]
  • Text2Performer: Text-Driven Human Video Generation
    Team: Nanyang Technological University
    Yuming Jiang, Shuai Yang, Tong Liang Koh, et al., Ziwei Liu
    arXiv, 2023.04 [Paper], [PDF], [Code], [Home Page], [[Demo(video)]
  • Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation
    Team: University of Rochester, Meta.
    Jie An, Songyang Zhang, Harry Yang, et al., Xi Yin
    arXiv, 2023.04 [Paper], [PDF], [Home Page]
  • Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
    Team: NVIDIA.
    Andreas Blattmann, Robin Rombach, Huan Ling, et al., Karsten Kreis
    CVPR'23, arXiv, 2023.04 [Paper], [PDF], [Home Page]
  • NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
    Team: University of Science and Technology of China, Microsoft.
    Shengming Yin, Chenfei Wu, Huan Yang, et al. , Nan Duan
    arXiv, 2023.03 [Paper], [PDF], [Home Page]
  • Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
    Team: Picsart AI Resarch (PAIR).
    Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, et al., Humphrey Shi
    arXiv, 2023.03 [Paper], [PDF], [Code], [Home Page], [Demo(live)], [Demo(video)]
  • Structure and Content-Guided Video Synthesis with Diffusion Models
    Team: Runway
    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, Anastasis Germanidis
    ICCV'23, arXiv, 2023.02 [Paper], [PDF], [Home Page]
  • SceneScape: Text-Driven Consistent Scene Generation
    Team: Weizmann Institute of Science, NVIDIA Research
    Rafail Fridman, Amit Abecasis, Yoni Kasten, Tali Dekel
    NeurIPS'23, arXiv, 2023.02 [Paper], [PDF], [Code], [Home Page]
  • MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
    Team: Renmin University of China, Peking University, Microsoft Research
    Ludan Ruan, Yiyang Ma, Huan Yang, et al., Baining Guo
    CVPR'23, arXiv, 2022.12 [Paper], [PDF], [Code]
  • Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
    Team: Show Lab, National University of Singapore.
    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Mike Zheng Shou et al
    ICCV'23, arxiv, 2022.12[Paper], [PDF], [Code], [Pretrained Model]
  • MagicVideo: Efficient Video Generation With Latent Diffusion Models
    Team: ByteDance Inc.
    Daquan Zhou, Weimin Wang, Hanshu Yan, et al., Jiashi Feng
    arXiv, 2022.11 [Paper], [PDF], [Home Page]
  • Latent Video Diffusion Models for High-Fidelity Long Video Generation Long Video
    Team: HKUST, Tencent AI Lab.
    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen
    arXiv, 2022.10 [Paper], [PDF], [Code], [Home Page]
  • Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
    Team: UC Santa Barbara, Meta.
    Tsu-Jui Fu, Licheng Yu, Ning Zhang, et al., Sean Bell
    CVPR'23, arXiv, 2022.11 [Paper], [PDF]
  • Phenaki: Variable Length Video Generation From Open Domain Textual Description
    Team: Google.
    Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, et al., Dumitru Erhan
    ICLR'23, arXiv, 2022.10 [Paper], [PDF], [Home Page]
  • Imagen Video: High Definition Video Generation with Diffusion Models
    Team: Google.
    Jonathan Ho, William Chan, Chitwan Saharia, et al., Tim Salimans
    arXiv, 2022.10 [Paper], [PDF], [Home Page]
  • StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation Story Visualization
    Team: UNC Chapel Hill.
    Adyasha Maharana, Darryl Hannan, Mohit Bansal
    ECCV'22, arXiv, 2022.09 [Paper], [PDF], [Code], [Demo(live)]
  • Make-A-Video: Text-to-Video Generation without Text-Video Data
    Team: Meta AI.
    Uriel Singer, Adam Polyak, Thomas Hayes, et al., Yaniv Taigman
    ICLR'23, arXiv, 2022.09 [Paper], [PDF], [Code]
  • MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
    Team: S-Lab, SenseTime.
    Mingyuan Zhang, Zhongang Cai, Liang Pan, et al., Ziwei Liu
    TPAMI'24, arxiv, 2022.08 [Paper], [PDF], [Code], [Home Page], [Demo]
  • Word-Level Fine-Grained Story Visualization Story Visualization
    Team: University of Oxford.
    Bowen Li, Thomas Lukasiewicz
    ECCV'22, arXiv, 2022.08 [Paper], [PDF], [Code], [Pretrained Model]
  • CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
    Team: Tsinghua University.
    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang
    ICLR'23, arXiv, 2022.05 [Paper], [PDF], [Code], [Home Page], [Demo(video)]
  • CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
    Team: Tsinghua University.
    Ming Ding, Wendi Zheng, Wenyi Hong, Jie Tang
    NeurIPS'22, arXiv, 2022.04 [Paper], [PDF], [Code], [Home Page]
  • Long video generation with time-agnostic vqgan and time-sensitive transformer
    Team: Meta AI.
    Songwei Ge, Thomas Hayes, Harry Yang, et al., Devi Parikh
    ECCV'22 arXiv, 2022.04 [Paper], [PDF], [Home Page], [Code]
  • Video Diffusion Models text-conditioned
    Team: Google.
    Jonathan Ho, Tim Salimans, Alexey Gritsenko, et al., David J. Fleet
    arXiv, 2022.04 [Paper], [PDF], [Home Page]
  • NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis Long Video
    Team: Microsoft.
    Chenfei Wu, Jian Liang, Xiaowei Hu, et al., Nan Duan
    NeurIPS'22, arXiv, 2022.02 [Paper], [PDF], [Code], [Home Page]
  • NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
    Team: Microsoft.
    Chenfei Wu, Jian Liang, Lei Ji, et al., Nan Duan
    ECCV'22, arXiv, 2021.11 [Paper], [PDF], [Code]
  • GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions
    Team: Microsoft, Duke University.
    Chenfei Wu, Lun Huang, Qianxi Zhang, et al., Nan Duan
    arXiv, 2021.04 [Paper], [PDF]
  • Cross-Modal Dual Learning for Sentence-to-Video Generation
    Team: Tsinghua University.
    Yue Liu, Xin Wang, Yitian Yuan, Wenwu Zhu
    ACM MM'19 [Paper], [PDF]
  • IRC-GAN: introspective recurrent convolutional GAN for text-to-video generation
    Team: Peking University.
    Kangle Deng, Tianyi Fei, Xin Huang, Yuxin Peng
    IJCAI'19 [Paper], [PDF]
  • Imagine this! scripts to compositions to videos
    Team: University of Illinois Urbana-Champaign, AI2, University of Washington.
    Tanmay Gupta, Dustin Schwenk, Ali Farhadi, et al., Aniruddha Kembhavi
    ECCV'18, arxiv, 2018.04 [Paper], [PDF]
  • To Create What You Tell: Generating Videos from Captions
    Team: USTC, Microsoft Research.
    Yingwei Pan, Zhaofan Qiu, Ting Yao, et al., Tao Mei
    ACM MM'17, arxiv, 2018.04 [Paper], [PDF]
  • Neural Discrete Representation Learning.
    Team: DeepMind.
    Aaron van den Oord, Oriol Vinyals, Dinghan Shen, Koray Kavukcuoglu
    NeurIPS'17, arxiv, 2017.11 [Paper], [PDF]
  • Video Generation From Text.
    Team: Duke University, NEC Labs America.
    Yitong Li, Martin Renqiang Min, Dinghan Shen, et al., Lawrence Carin
    AAAI'18, arxiv, 2017.10 [Paper], [PDF]
  • Attentive semantic video generation using captions.
    Team: IIT Hyderabad.
    Tanya Marwah, Gaurav Mittal, Vineeth N. Balasubramanian
    ICCV'17, arxiv, 2017.08 [Paper], [PDF]
  • Sync-DRAW: Automatic Video Generation using Deep Recurrent Attentive Architectures VAE
    Team: IIT Hyderabad.
    Gaurav Mittal, Tanya Marwah, Vineeth N. Balasubramanian
    ACM MM'17, arXiv, 2016.11 [Paper], [PDF]

Datasets & Metrics

Datasets are divided according to their collected domains: Face, Open, Movie, Action, Instruct.
Metrics are divided as image-level, video-level.

  • (CV-Text) Celebv-text: A large-scale facial text-video datase Dataset (Domain:Face)
    Team: University of Sydney, SenseTime Research.
    Jianhui Yu, Hao Zhu, Liming Jiang, et al., Wayne Wu
    CVPR'23, arXiv, 2023.03 [Paper], [PDF], [Code], [Demo], [Home Page]

  • (MSR-VTT) Msr-vtt: A large video description dataset for bridging video and language Dataset (Domain:Open)
    Team: Microsoft Research.
    Jun Xu , Tao Mei , Ting Yao and Yong Rui
    CVPR'16 [Paper], [PDF]

  • (DideMo) Localizing moments in video with natural language Dataset (Domain:Open)
    Team: UC Berkeley, Adobe
    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, et al., Bryan Russell
    ICCV'17, arXiv, 2017.08 [Paper], [PDF]

  • (YT-Tem-180M) Merlot: Multimodal neural script knowledge models Dataset (Domain:Open)
    Team: University of Washington
    Rowan Zellers, Ximing Lu, Jack Hessel, et al., Yejin Choi
    NeurIPS'21, arXiv, 2021.06 [Paper], [PDF], [Code], [Home Page]

  • (WebVid2M) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Dataset (Domain:Open)
    Team: University of Oxford, CNRS.
    Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman
    ICCV'21, arXiv, 2021.04 [Paper], [PDF],[Dataset], [Code],[Demo], [Home Page]

  • (HD-VILA-100M) Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions Dataset (Domain:Open)
    Team: Microsoft Research Asia.
    Hongwei Xue, Tiankai Hang, Yanhong Zeng, et al., Baining Guo
    CVPR'22, arXiv, 2021.11 [Paper], [PDF], [Code]

  • (InterVid) Internvid: A large-scale video-text dataset for multimodal understanding and generation Dataset (Domain:Open)
    Team: Shanghai AI Laboratory.
    Yi Wang, Yinan He, Yizhuo Li, et al., Yu Qiao
    arXiv, 2023.07 [Paper], [PDF], [Code]

  • (HD-VG-130M) VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation Dataset (Domain:Open)
    Team: Peking University, Microsoft Research.
    Wenjing Wang, Huan Yang, Zixi Tuo, et al., Jiaying Liu
    arXiv, 2023.05 [Paper], [PDF]

  • (Youku-mPLUG) Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks Dataset (Domain:Open)
    Team: DAMO Academy, Alibaba Group.
    Haiyang Xu, Qinghao Ye, Xuan Wu, et al., Fei Huang
    arXiv, 2023.06 [Paper], [PDF]

  • (VAST-27M) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset Dataset (Domain:Open)
    Team: UCAS, CAS
    Sihan Chen, Handong Li, Qunbo Wang, et al., Jing Liu
    NeurIPS'23, arXiv, 2023.05 [Paper], [PDF]

  • (Panda-70M) Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers Dataset (Domain:Open)
    Team: Snap Inc., University of California, University of Trento.
    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Sergey Tulyakov
    arXiv, 2024.02 [Paper], [PDF], [Code], [Home Page]

  • (LSMDC) Movie description Dataset (Domain:Movie)
    Team: Max Planck Institute for Informatics.
    Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, et al., Bernt Schiele
    IJCV'17, arXiv, 2016.05 [Paper], [PDF], [Home Page]

  • (MAD) Mad: A scalable dataset for language grounding in videos from movie audio descriptions Dataset (Domain:Movie)
    Team: KAUST, Adobe Research.
    Mattia Soldan, Alejandro Pardo, Juan León Alcázar, et al., Bernard Ghanem
    CVPR'22, arXiv, 2021.12 [Paper], [PDF], [Code]

  • (UCF-101) UCF101: A dataset of 101 human actions classes from videos in the wild Dataset (Domain:Action)
    Team: University of Central Florida.
    Khurram Soomro, Amir Roshan Zamir, Mubarak Shah
    arXiv, 2012.12 [Paper], [PDF], [Data]

  • (ActNet-200) Activitynet: A large-scale video benchmark for human activity understanding Dataset (Domain:Action)
    Team: Universidad del Norte, KAUST
    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, Juan Carlos Niebles
    CVPR'15, [Paper], [PDF], [Home Page]

  • (Charades) Hollywood in homes: Crowdsourcing data collection for activity understanding Dataset (Domain:Action)
    Team: Carnegie Mellon University
    Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, et al., Abhinav Gupta
    ECCV'16, arXiv, 2016.04, [Paper], [PDF], [Home Page]

  • (Kinetics) The kinetics human action video dataset Dataset (Domain:Action)
    Team: Google
    Will Kay, Joao Carreira, Karen Simonyan, et al., Andrew Zisserman
    arXiv, 2017.05, [Paper], [PDF], [Home Page]

  • (ActivityNet) Dense-captioning events in videos Dataset (Domain:Action)
    Team: Stanford University
    Ranjay Krishna, Kenji Hata, Frederic Ren, et al., Juan Carlos Niebles
    ICCV'17, arXiv, 2017.05, [Paper], [PDF], [Home Page]

  • (Charades-Ego) Charades-ego: A large-scale dataset of paired third and first person videos Dataset (Domain:Action)
    Team: Carnegie Mellon University
    Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, et al., Karteek Alahari
    arXiv, 2018.04, [Paper], [PDF], [Home Page]

  • (SS-V2) The "something something" video database for learning and evaluating visual common sense Dataset (Domain:Action)
    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, et al., Roland Memisevic
    ICCV'17, arXiv, 2017.06 [Paper], [PDF], [Home Page]

  • (How2) How2: a large-scale dataset for multimodal language understanding Dataset (Domain:Instruct)
    Team: Carnegie Mellon University.
    Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, et al., Florian Metze
    arXiv, 2018.11 [Page], [PDF]

  • (HowTo100M) HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Dataset (Domain:Instruct)
    Team: Ecole Normale Superieure, Inria, CIIRC.
    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, et al., Josef Sivic
    arXiv, 2019.06 [Page], [PDF], [Home Page]

  • (YouCook2) Towards automatic learning of procedures from web instructional video Dataset (Domain:Cooking)
    Team: University of Michigan, University of Rochester
    Luowei Zhou, Chenliang Xu, Jason J. Corso
    AAAI'18, arXiv, 2017.03 , [Paper], [PDF],[Home Page]

  • (Epic-Kichens) Scaling egocentric vision: The epic-kitchens dataset Dataset (Domain:Cookding)
    Team: Uni. of Bristol, Uni. of Catania, Uni. of Toronto.
    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, et al., Michael Wray
    ECCV'18, arXiv, 2018.04, [Paper], [PDF], [Home Page]

  • (PSNR/SSIM) Image quality assessment: from error visibility to structural similarity Metric (image-level)
    Team: New York University.
    Zhou Wang, Alan Conrad Bovik, Hamid Rahim Sheikh, E.P. Simoncelli
    IEEE TIP, 2004.04. [Paper], [PDF]

  • (IS) Improved techniques for training gans Metric (image-level)
    Team: OpenAI
    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, et al., Xi Chen
    NeurIPS'16, arXiv, 2016.06, [Paper], [PDF], [Code]

  • (FID) Gans trained by a two time-scale update rule converge to a local nash equilibrium Metric (image-level)
    Team: Johannes Kepler University Linz
    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, et al., Sepp Hochreiter
    NeurIPS'17, arXiv, 2017.06 [Paper], [PDF]

  • (CLIP Score) Learning transferable visual models from natural language supervision Metric (image-level)
    Team: OpenAI.
    Alec Radford, Jong Wook Kim, Chris Hallacy, et al., Ilya Sutskever
    ICML'21, arXiv, 2021.02 [Paper], [PDF], [Code]

  • (Video IS) Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan Metric (video-level)
    Masaki Saito, Shunta Saito, Masanori Koyama, Sosuke Kobayashi
    IJCV'20, arXiv, 2018.11 [Paper], [PDF], [Code]

  • (FVD/KVD) FVD: A new metric for video generation Metric (video-level)
    Team: Johannes Kepler University, Google
    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, et al., Sylvain Gelly
    ICLR'19, arXiv, 2018.12 [Paper], [PDF], [Code]

  • (FCS) Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation Metric (video-level)
    Team: Show Lab, National University of Singapore.
    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Mike Zheng Shou et al
    ICCV'23, arxiv, 2022.12[Paper], [PDF], [Code], [Pretrained Model]


Acknowledgement

Citation

If you find this repository useful, please consider citing this list:

@misc{rui2024t2vgenerationlist,
    title = {Awesome-Text-to-Video-Generation},
    author = {Rui Sun, Yumin Zhang},
    journal = {GitHub repository},
    url = {https://github.com/soraw-ai/Awesome-Text-to-Video-Generation},
    year = {2024},
}

References

About

A list for Text-to-Video, Image-to-Video works

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published