Skip to content

Latest commit

 

History

History
37 lines (17 loc) · 1.17 KB

File metadata and controls

37 lines (17 loc) · 1.17 KB

Any2Caption: Interpreting Any Condition to Caption for Controllable Video Generation

TL;DR

We present Any2Caption, a novel framework for controllable video generation from any condition. The key idea is decoupling various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs—text, images, videos, and specialized cues such as region, motion, and camera poses—into dense, structured captions that offer backbone video generators with better guidance.

framework

Code

Stay Tuned.

Citation

If you find Any2Caotion is useful and use it in your project, please kindly cite:

@inproceedings{wu2025Any2Caption,
    title={Any2Caption: Interpreting Any Condition to Caption for Controllable Video Generation},
    author={Shengqiong Wu and Weicai Ye and Jiahao Wang and Quande Liu and Xintao Wang and Pengfei Wan and Di Zhang and Kun Gai and Shuicheng Yan and Hao Fei and Tat-Seng Chua},
    booktitle={arxiv},
    year={2025}
}