Skip to content

FireRedTeam/IVC-Prune

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning

Zhichao Sun, Yidong Ma, Gang Liu, Yibo Chen, Xu Tang, Yao Hu, Yongchao Xu

Xiaohongshu Inc.     Wuhan university

ICLR 2026

Official License

Overview

💡 We reveal a fundamental mechanism of how LVLMs process spatial information:

LVLMs implicitly establish visual coordinate systems through Rotary Position Embeddings (RoPE).

Through theoretical analysis, we discover that specific token positions serve as Implicit Visual Coordinates (IVC tokens)—spatial reference points essential for absolute object localization. These positions occur where RoPE's rotation matrices approximate:
  • Identity matrix (real-axis references)

  • 90° rotation matrix (imaginary-axis references)

This provides the first theoretical characterization of spatial reasoning mechanisms in LVLMs.

🚀 Method: IVC-Prune

A training-free, prompt-aware pruning strategy that preserves two crucial token types:

  1. IVC Tokens: Identified by analyzing RoPE's mathematical properties (cosine/sine components across dimensions)
  2. Foreground Tokens: Selected via a robust two-stage process:
    • Stage 1: Semantic seed identification using value-vector similarity (avoiding positional bias)
    • Stage 2: Contextual refinement to capture complete objects

Key Innovation: Single-selection pruning strategy—tokens are selected once at an intermediate layer and applied across all layers, maximizing KV-cache reduction while preserving original position IDs.

📝 TODO List:

Open Source Plan for Qwen, LLaVA, InternVL, DeepSeek Support

Supported LVLMs

Table of Contents

Installation

Based on VLMEvalKit and transformers. We have supported grounding data testing code in VLMEvalKit.

# Step 1: Create and activate conda environment
conda create --name IVCP python=3.10.6 -y
conda activate IVCP

# Step 2: Install PyTorch (cu118)
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118

# Step 3: Install local transformers
pip install -e /path/to/IVCP/transformers

# Step 4: Install flash-attn
pip install flash-attn==2.5.8 --no-build-isolation -v

# Step 5: Install DeepSeek-VL2
cd /path/to/IVCP/DeepSeek-VL2
pip install -e .

# Step 6: Install VLMEvalKit
cd /path/to/IVCP/VLMEvalKit
pip install -e .

pip uninstall numba -y
pip install numba
pip install qwen_vl_utils

Dataset

For RefCOCO grounding dataset, We provide MDETR format annotations on Hugging Face.

Setup Instructions

Step 1: Download Required Files

Step 2: Generate TSV Files

Standard models:

cd IVCP/VLMEvalKit
python convert_to_tsv.py \\
    --images_folder /path/to/train2014/ \\
    --annotations_files /path/to/finetune_refcoco_val.json \\
    --output_dir /path/to/output/

Qwen2.5-VL (resized coordinates to multiples of 28):

cd IVCP/VLMEvalKit
python convert_to_tsv.py \\
    --images_folder /path/to/train2014/ \\
    --annotations_files /path/to/finetune_refcoco_val.json \\
    --output_dir /path/to/output/ \\    
    --qwen25

Step 3: Configure Paths

Modify DATASET_URL in vlmeval/dataset/image_grounding.py:

DATASET_URL = {
    'RefCOCO_testA': '/PATH/TO/refcoco_testA.tsv',
    'RefCOCO_testB': '/PATH/TO/refcoco_testB.tsv',
    'RefCOCO_val': '/PATH/TO/refcoco_val.tsv',   
    # ... other splits
    }

Note: Always use --qwen25 flag when evaluating Qwen2.5-VL to ensure GT boxes match the model's patch size (28)."

Usage

To evaluate the model on grounding tasks, run the following script:

cd IVCP/VLMEvalKit
bash test_ivcp_qwen_grounding.sh

bash test_ivcp_internvl_grounding.sh

bash test_ivcp_deepseekvl_grounding.sh

To evaluate general VQA tasks, run:

bash test_ivcp_qwen_generalvqa.sh

bash test_ivcp_internvl_generalvqa.sh

bash test_ivcp_deepseekvl_generalvqa.sh

bash test_ivcp_llava_generalvqa.sh

Note:
Before running the scripts, please make sure to modify the dataset paths in test_ivcp_qwen_grounding.sh (specifically lines 2-3) to point to your actual dataset location, e.g.: export LMUData="/your/dataset/path"

Some VQA Benchmark tests require the use of an API. Please configure your API according to the VLMEvalKit instructions. By default, we use GPT-4o.

Citation

@misc{sun2026ivcprunerevealingimplicitvisual,
      title={IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning}, 
      author={Zhichao Sun and Yidong Ma and Gang Liu and Yibo Chen and Xu Tang and Yao Hu and Yongchao Xu},
      year={2026},
      eprint={2602.03060},
      archivePrefix={arXiv},
}

About

IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors