Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification

| Blog | Tech Report | 🤗HF Models |

We introduce Video-XL-2, a new suite of multimodal models that achieves state-of-the-art (SOTA) performance and superior efficiency in long video understanding.

Video-XL-2: SOTA Performance and Unrivaled Efficiency

Video-XL-2 achieves SOTA performance in mainstream long video understanding benchmarks and leading performance in temporal grounding tasks when compared to open-source lightweight models. Furthermore, it boasts significant advantages over existing models in both memory consumption and inference speed."

Model Weights

Model name HF Weight
Video-XL-2/Stage1 🤗 HF link
Video-XL-2/Stage2 🤗 HF link
Video-XL-2/Stage3 🤗 HF link
Video-XL-2/Stage4 🤗 HF link

Setup

Clone this repository and install required packages:

git clone https://github.com/VectorSpaceLab/Video-XL
cd Video-XL-2
pip install -r requirements.txt

Training

The training codes and scripts can be found in ./train.

Evaluation

The evaluation codes and scripts can be found in ./eval.

Acknowledgement

We thank the great work from Video-XL Series, LongVA, lmms-eval, Qwen,VideoChat-Flash.

Citation

If you find Video-XL-2 useful for your research and applications, please consider starring this repository and citing:

@article{qin2025video,
  title={Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification},
  author={Qin, Minghao and Liu, Xiangrui and Liang, Zhengyang and Shu, Yan and Yuan, Huaying and Zhou, Juenjie and Xiao, Shitao and Zhao, Bo and Liu, Zheng},
  journal={arXiv preprint arXiv:2506.19225},
  year={2025}
}

@article{shu2024video,
  title={Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding},
  author={Shu, Yan and Zhang, Peitian and Liu, Zheng and Qin, Minghao and Zhou, Junjie and Huang, Tiejun and Zhao, Bo},
  journal={arXiv preprint arXiv:2409.14485},
  year={2024}
}

@article{liu2025video,
  title={Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding},
  author={Liu, Xiangrui and Shu, Yan and Liu, Zheng and Li, Ao and Tian, Yang and Zhao, Bo},
  journal={arXiv preprint arXiv:2503.18478},
  year={2025}
}