This repository contains the training code for AVR, an adaptive visual reasoning framework for reducing overthinking in visual reasoning models. AVR decomposes visual reasoning into three cognitive functions, visual perception, logical reasoning, and answer application, and trains the model to choose among direct-answer, perception-only, and full-format responses based on task difficulty.
AVR targets reasoning path redundancy in vision-language reasoning. The goal is to preserve answer correctness while reducing unnecessary token usage through adaptive format selection. In the paper, AVR reduces token usage by 50-90% across benchmarks while maintaining competitive accuracy.
- Identifies reasoning path redundancy in visual reasoning models
- Uses multi-format SFT to teach structured, task-adaptive outputs
- Uses FS-GRPO to optimize correctness and efficiency jointly
- Supports direct-answer, perception-only, and full reasoning responses in one model
AVR uses a two-stage training pipeline:
- Supervised fine-tuning teaches the model to follow the structured output formats used by AVR.
- FS-GRPO further optimizes format selection so the model prefers the most efficient response that still remains correct.
The reward used in FS-GRPO combines:
- answer correctness
- format-dependent reward shaping
- response-length regularization
- intra-group diversity bonus for encouraging exploration early in training
- reward_functions: adaptive reward used for FS-GRPO training
- train_configs: SFT and FS-GRPO configuration files
- setup_fs-grpo_env.sh: environment setup script for verl-based training
- train-sft.sh: entry script for SFT with LLaMA-Factory
- train-fs-grpo.sh: entry script for FS-GRPO training with
verl - merge_fs-grpo_ckpt.sh: checkpoint merge helper
- LLaMA-Factory: SFT training framework submodule
- verl: RL training framework submodule
AVR trains the model to select one of three response formats:
- Direct answer
<answer>...</answer>
- Perception plus answer
<perception>...</perception>
<answer>...</answer>
- Full reasoning
<perception>...</perception>
<reasoning>...</reasoning>
<answer>...</answer>
These formats correspond to progressively richer reasoning paths. During training, the model is encouraged to use the shortest format that can still solve the task correctly.
SFT is run through LLaMA-Factory and teaches the base VLM to emit AVR-style structured responses.
Example entrypoint:
bash train-sft.shThe current script launches:
FORCE_TORCHRUN=1 llamafactory-cli train /path/to/AVR/train_configs/qwen3vl_2b_full_sft_all.yamlThe released repository keeps a single SFT config in train_configs:
qwen3vl_2b_full_sft_all.yaml
Shared SFT characteristics:
- full-parameter fine-tuning
- frozen vision tower and multimodal projector
- one training epoch in the checked-in configs
Before running SFT, make sure:
- the LLaMA-Factory submodule is initialized
- the target conda environment is available
- the training dataset referenced in the YAML config is registered in LLaMA-Factory
Use the provided setup script to prepare the verl environment for FS-GRPO:
bash setup_fs-grpo_env.shThe script:
- creates a Python 3.10 conda environment
- installs
vllm,flash-attn, and the training dependencies - pins
transformersto the 4.x series - installs
verlin editable mode - optionally logs into Weights & Biases and Hugging Face if
WANDB_API_KEYorHF_TOKENare set
After setup, activate the environment with:
conda activate verlFS-GRPO is run with verl and starts from an SFT checkpoint.
Example entrypoint:
bash train-fs-grpo.sh --sft_model <path-to-sft-checkpoint> --data_dir <fs-grpo-data-dir> --output_dir <save-dir>The checked-in FS-GRPO setup uses:
algorithm.adv_estimator=grpo- 8 sampled responses per prompt
- adaptive reward from adaptive_reward.py
- correctness reward with format shaping
- response-length regularization
- diversity bonus with cosine decay over training
The reward logic is designed to reflect the paper's FS-GRPO objective: the model should preserve correctness while learning when direct answer, perception-only output, or full reasoning is most appropriate.
Model checkpoints and parquet data are expected to be prepared separately before running FS-GRPO.
- Prepare the SFT data with AVR response-format annotations.
- Run SFT to obtain a structured-response checkpoint.
- Prepare FS-GRPO parquet data with prompts, images, and ground-truth answers.
- Launch FS-GRPO from the SFT checkpoint.
- Merge the final actor checkpoint to Hugging Face format if needed.
- Helper scripts now default to repository-relative paths and can be overridden with arguments or environment variables when needed.
- The repository includes both LLaMA-Factory and
verlas submodules for reproducibility. - The checked-in FS-GRPO reward implementation is the single maintained reward path for this repository.