Multi-Task Video Generative Foundation Model with Full Attention

1Kuaishou Technology,  2The Chinese University of Hong Kong,  *Corresponding Author

TL;DR: We propose FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms.



Abstract

Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.

Method

Overview of FullDiT architecture and comparison with adapter-based models. We present the diffusion process of the multi-task video generative model on the left. For research purposes, this paper shows input conditions consisting of temporal-only cameras, spatial-only identities, and temporal-spatial depth video. Additional conditions can be incorporated into this model architecture for broader applications. Shown in (a), FullDiT unifies various inputs with procedures: (1) patchify and tokenize the input condition to a unified sequence representation, (2) concat all sequences together to a longer one, and (3) learn condition dynamics with full self-attention. By comparison, earlier adapter-based approaches (shown in (b)) use distinct adapter designs that operate independently to process various inputs. The subscript of each block signifies its layer index.

Advantages:
(1) Long-context learning ability ⮕ Better performance and controllability
(2) Unified Representation ⮕ Scalable extension to additional modalities or conditions without major architectural modifications
(3) Strong scalability ⮕ Higher training data utilization rate
(4) Emergent capabilities ⮕ Generalizing to previously unseen combinations of conditions


Comparisons


Demos

Camera+Identities+Depth+Text ⮕ Video

(Results of a small model with around 1B parameter)

Camera+Identities+Text ⮕ Video

(Results of a small model with around 1B parameter)

Identities+Depth+Text ⮕ Video

(Results of a small model with around 1B parameter)

Identities+Text ⮕ Video

(Results of a small model with around 1B parameter)

Depth+Text ⮕ Video

(Results of a small model with around 1B parameter)

Camera+Text ⮕ Video

(Results of a small model with around 1B parameter)

Reference:
[1] Wang Z, Yuan Z, Wang X, et al. Motionctrl: A unified and flexible motion controller for video generation[C]//ACM SIGGRAPH 2024 Conference Papers. 2024: 1-11.
[2] He H, Xu Y, Guo Y, et al. Cameractrl: Enabling camera control for text-to-video generation[J]. arXiv preprint arXiv:2404.02101, 2024.
[3] Zheng G, Li T, Jiang R, et al. Cami2v: Camera-controlled image-to-video diffusion model[J]. arXiv preprint arXiv:2410.15957, 2024.
[4] Huang Y, Yuan Z, Liu Q, et al. ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning[J]. arXiv preprint arXiv:2501.04698, 2025.
[5] Lin H, Cho J, Zala A, et al. Ctrl-adapter: An efficient and versatile framework for adapting diverse controls to any diffusion model[J]. arXiv preprint arXiv:2404.09967, 2024.
[6] Zhang Y, Wei Y, Jiang D, et al. Controlvideo: Training-free controllable text-to-video generation[J]. arXiv preprint arXiv:2305.13077, 2023.


Acknowledgments:
We thank Yawen Luo, Qinghe Wang, Yuzhou Huang, Ziyang Yuan, Xiaoyu Shi, Menghan Xia, Jianhong Bai, Zhixue Fang from Kuaishou Technology for their invaluable assistance in constructing the training dataset, as well as their insightful suggestions and discussions.