| Blog | Tech Report | 🤗HF Models |
We introduce Video-XL-2, a new suite of multimodal models that achieves state-of-the-art (SOTA) performance and superior efficiency in long video understanding.
Video-XL-2 achieves SOTA performance in mainstream long video understanding benchmarks and leading performance in temporal grounding tasks when compared to open-source lightweight models. Furthermore, it boasts significant advantages over existing models in both memory consumption and inference speed."
| Model name | HF Weight |
|---|---|
| Video-XL-2/Stage1 | 🤗 HF link |
| Video-XL-2/Stage2 | 🤗 HF link |
| Video-XL-2/Stage3 | 🤗 HF link |
| Video-XL-2/Stage4 | 🤗 HF link |
Clone this repository and install required packages:
git clone https://github.com/VectorSpaceLab/Video-XL
cd Video-XL-2
pip install -r requirements.txtThe training codes and scripts can be found in ./train.
The evaluation codes and scripts can be found in ./eval.
We thank the great work from Video-XL Series, LongVA, lmms-eval, Qwen,VideoChat-Flash.
If you find Video-XL-2 useful for your research and applications, please consider starring this repository and citing:
@article{qin2025video,
title={Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification},
author={Qin, Minghao and Liu, Xiangrui and Liang, Zhengyang and Shu, Yan and Yuan, Huaying and Zhou, Juenjie and Xiao, Shitao and Zhao, Bo and Liu, Zheng},
journal={arXiv preprint arXiv:2506.19225},
year={2025}
}
@article{shu2024video,
title={Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding},
author={Shu, Yan and Zhang, Peitian and Liu, Zheng and Qin, Minghao and Zhou, Junjie and Huang, Tiejun and Zhao, Bo},
journal={arXiv preprint arXiv:2409.14485},
year={2024}
}
@article{liu2025video,
title={Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding},
author={Liu, Xiangrui and Shu, Yan and Liu, Zheng and Li, Ao and Tian, Yang and Zhao, Bo},
journal={arXiv preprint arXiv:2503.18478},
year={2025}
}