Xueyi Chen1,2, Keda Tao1,3, Kele Shao1,4,3, Huan Wang1,*
1Westlake University 2The Chinese University of Hong Kong 3Zhejiang University 4SII
*Corresponding author
Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens, ensuring predictable latency. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves 15.7x kv-cache compression ratio; compared to prior SOTA (LiveVLM), it delivers 1.2x lower peak memory and 2x faster TTFT. StreamingTOM achieves state-of-the-art accuracy among training-free methods with an average of 63.8% on offline benchmarks and 55.8% accuracy and 3.7 score on RVS.
git clone https://github.com/YIGE24/StreamingTOM.git
cd StreamingTOM
conda create -n streamingtom python=3.10 -y
conda activate streamingtom
pip install torch==2.5.1
pip install transformers==4.53.3
pip install flash-attn==2.8.0.post2 --no-build-isolation
pip install datasets sacrebleu
pip install -e LLaVA-NeXT
pip install -e lmms-eval| Variable | Description | Default |
|---|---|---|
CTR_RETAIN_TOKENS |
Tokens retained per frame | 50 |
CTR_SIMILARITY_THRESHOLD |
Cosine similarity threshold for static/dynamic classification | 0.9 |
CTR_K |
Number of neighbors for DPC clustering | 7 |
CTR_BETA |
Weighting factor for DPC cluster merging | 0.6 |
OQM_RETRIEVAL_MAX_TOKENS |
Token budget for retrieval | 12544 |
OQM_ENABLE_QUANTIZATION |
Enable 4-bit quantization (0 or 1) |
1 |
OQM_QUANTIZATION_BITS |
Quantization bits (2 or 4) |
4 |
OQM_GROUP_SIZE |
KV group size (must equal CTR_RETAIN_TOKENS) |
50 |
OQM_INIT_TOKEN_COUNT |
Number of system prompt tokens to preserve unquantized | 14 |
OQM_SLIDING_WINDOW_SIZE |
Sliding window size for encode phase | 4800 |
STREAMING_ENCODER_BATCH_SIZE |
Vision encoder batch size | 32 |
Example: evaluate on VideoMME-Short with 8 GPUs.
export WRAPPER=streamingtom
export CTR_K=7
export CTR_BETA=0.6
export CTR_SIMILARITY_THRESHOLD=0.9
export CTR_RETAIN_TOKENS=50
export OQM_GROUP_SIZE=50
export OQM_SLIDING_WINDOW_SIZE=4800
export OQM_RETRIEVAL_MAX_TOKENS=12544
export OQM_ENABLE_QUANTIZATION=1
export OQM_QUANTIZATION_BITS=4
export OQM_INIT_TOKEN_COUNT=14
export STREAMING_ENCODER_BATCH_SIZE=32
export STREAMINGTOM_USE_FULL_PROMPT=0
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
accelerate launch --num_processes=8 -m lmms_eval \
--model llava_onevision \
--model_args pretrained="lmms-lab/llava-onevision-qwen2-7b-ov",fps=auto \
--tasks videomme_short \
--batch_size 1 \
--log_samples \
--log_samples_suffix LLAVA_OV_STREAMINGTOM \
--output_path ./results/streamingtomThis project is built upon several excellent open-source projects. We thank the teams behind LLaVA-NeXT, lmms-eval, and ReKV for their foundational work.
@inproceedings{chen2026streamingtom,
title={StreamingTOM: Streaming Token Compression for Efficient Video Understanding},
author={Chen, Xueyi and Tao, Keda and Shao, Kele and Wang, Huan},
booktitle={CVPR},
year={2026}
}