Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design (CVPR 2026)
We revisit GRPO training for visual segmentation and detection and propose Dr. Seg, a simple plug-and-play framework featuring a Look-to-Confirm mechanism and a Distribution-Ranked Reward module. It requires no architectural modifications and integrates seamlessly with existing GRPO-based VLLMs. Extensive experiments show that Dr. Seg improves performance in complex visual scenarios while preserving strong generalization.
Paper: 📖 Dr.Seg
Model: 🤗 Dr.Seg-7B
Dataset: 🤗 COCONut
Overview of Dr. Seg:
- Release checkpoint
- Renew README
- Release training code
- Release evaluation code on segmentation
- Release dataset
- Release evaluation code on detection and counting
git clone https://github.com/eVI-group-SCU/Dr-Seg
cd Dr-Seg
conda create -n drseg python=3.12
conda activate drseg
pip install torch==2.6.0 torchvision==0.21.0
pip install -e . --no-build-isolationNote
We recommend using 4×80GB GPUs and at least 400GB of RAM.
As a reference, it takes approximately 15 hours to run ~500 training steps on 4× H800 PCIe.
Training Data (thanks to VisionReasoner): 🤗 MultiObject-7K
(1) Download the dataset using this script:
python training_scripts/download_dataset.py(2) Download the pretrained model using the following commands:
mkdir pretrained_models
cd pretrained_models
git lfs install
git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct(3) Modify trainer.save_checkpoint_path in training_scripts/run_drseg_7b_4x80G.sh:
trainer.save_checkpoint_path=your_path_to_checkpoint/${RUN_NAME}(4) Start Distribution-Ranked Reward module:
python -u drr_module/serve.py --host 127.0.0.1 --port 50070 Note
Remember to configure Weights & Biases (wandb) correctly to upload training logs.
(5) Start training in another terminal using this script:
bash training_scripts/run_drseg_7b_4x80G.shNote
We recommend running training for 400–600 steps.
(6) Merge Checkpoint in Hugging Face Format:
python3 training_scripts/model_merger.py --local_dir [path_to_your_actor_checkpoint]Note
Evaluation Data (thanks to VisionReasoner):
🤗 ReasonSeg-Val 🤗 ReasonSeg-Test
🤗 refcoco_val 🤗 refcoco_testA
🤗 refcocoplus_val 🤗 refcocoplus_testA
🤗 refcocog_val 🤗 refcocog_testA
(1) Modify REASONING_MODEL_PATH in evaluation_scripts/eval_segmentation_drseg.sh:
REASONING_MODEL_PATH:=your/path/to/checkpoint(2) (Optional) Modify TEST_DATA_PATH:=Ricky06662/ReasonSeg_val in evaluation_scripts/eval_segmentation_drseg.sh:
REASONING_MODEL_PATH:=datasete/you/want/to/eval(3)Start evaluation:
bash evaluation_scripts/eval_segmentation_drseg.sh@article{sun2026dr,
title={Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design},
author={Sun, Haoxiang and Wang, Tao and Tang, Chenwei and Yuan, Li and Lv, Jiancheng},
journal={arXiv preprint arXiv:2603.00152},
year={2026}
}This project builds upon several open-source efforts, including VisionReasoner, Seg-Zero, EasyR1, veRL, and COCONut-PanCap. We also utilize pretrained models from Qwen2.5-VL and SAM2. We sincerely thank the authors and maintainers for releasing high-quality code and models, providing clear documentation and reproducible pipelines, and actively maintaining these projects, which significantly facilitated our implementation and evaluation.
