[CVPR 2026] StreamingTOM: Streaming Token Compression for Efficient Video Understanding

Xueyi Chen^1,2, Keda Tao^1,3, Kele Shao^1,4,3, Huan Wang^1,*

¹Westlake University ²The Chinese University of Hong Kong ³Zhejiang University ⁴SII

^*Corresponding author

Abstract

Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens, ensuring predictable latency. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves 15.7x kv-cache compression ratio; compared to prior SOTA (LiveVLM), it delivers 1.2x lower peak memory and 2x faster TTFT. StreamingTOM achieves state-of-the-art accuracy among training-free methods with an average of 63.8% on offline benchmarks and 55.8% accuracy and 3.7 score on RVS.

Installation

git clone https://github.com/YIGE24/StreamingTOM.git
cd StreamingTOM
conda create -n streamingtom python=3.10 -y
conda activate streamingtom
pip install torch==2.5.1
pip install transformers==4.53.3
pip install flash-attn==2.8.0.post2 --no-build-isolation
pip install datasets sacrebleu
pip install -e LLaVA-NeXT
pip install -e lmms-eval

Usage

Environment Variables

Variable	Description	Default
`CTR_RETAIN_TOKENS`	Tokens retained per frame	`50`
`CTR_SIMILARITY_THRESHOLD`	Cosine similarity threshold for static/dynamic classification	`0.9`
`CTR_K`	Number of neighbors for DPC clustering	`7`
`CTR_BETA`	Weighting factor for DPC cluster merging	`0.6`
`OQM_RETRIEVAL_MAX_TOKENS`	Token budget for retrieval	`12544`
`OQM_ENABLE_QUANTIZATION`	Enable 4-bit quantization (`0` or `1`)	`1`
`OQM_QUANTIZATION_BITS`	Quantization bits (`2` or `4`)	`4`
`OQM_GROUP_SIZE`	KV group size (must equal `CTR_RETAIN_TOKENS`)	`50`
`OQM_INIT_TOKEN_COUNT`	Number of system prompt tokens to preserve unquantized	`14`
`OQM_SLIDING_WINDOW_SIZE`	Sliding window size for encode phase	`4800`
`STREAMING_ENCODER_BATCH_SIZE`	Vision encoder batch size	`32`

Running Evaluation

Example: evaluate on VideoMME-Short with 8 GPUs.

export WRAPPER=streamingtom
export CTR_K=7
export CTR_BETA=0.6
export CTR_SIMILARITY_THRESHOLD=0.9
export CTR_RETAIN_TOKENS=50
export OQM_GROUP_SIZE=50
export OQM_SLIDING_WINDOW_SIZE=4800
export OQM_RETRIEVAL_MAX_TOKENS=12544
export OQM_ENABLE_QUANTIZATION=1
export OQM_QUANTIZATION_BITS=4
export OQM_INIT_TOKEN_COUNT=14
export STREAMING_ENCODER_BATCH_SIZE=32
export STREAMINGTOM_USE_FULL_PROMPT=0

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
accelerate launch --num_processes=8 -m lmms_eval \
  --model llava_onevision \
  --model_args pretrained="lmms-lab/llava-onevision-qwen2-7b-ov",fps=auto \
  --tasks videomme_short \
  --batch_size 1 \
  --log_samples \
  --log_samples_suffix LLAVA_OV_STREAMINGTOM \
  --output_path ./results/streamingtom

Acknowledgments

This project is built upon several excellent open-source projects. We thank the teams behind LLaVA-NeXT, lmms-eval, and ReKV for their foundational work.

Citation

@inproceedings{chen2026streamingtom,
  title={StreamingTOM: Streaming Token Compression for Efficient Video Understanding},
  author={Chen, Xueyi and Tao, Keda and Shao, Kele and Wang, Huan},
  booktitle={CVPR},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LLaVA-NeXT		LLaVA-NeXT
lmms-eval		lmms-eval
streamingtom		streamingtom
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[CVPR 2026] StreamingTOM: Streaming Token Compression for Efficient Video Understanding

Abstract

Installation

Usage

Environment Variables

Running Evaluation

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[CVPR 2026] StreamingTOM: Streaming Token Compression for Efficient Video Understanding

Abstract

Installation

Usage

Environment Variables

Running Evaluation

Acknowledgments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages