Skip to content

eVI-group-SCU/Dr-Seg

Repository files navigation

Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design (CVPR 2026)

We revisit GRPO training for visual segmentation and detection and propose Dr. Seg, a simple plug-and-play framework featuring a Look-to-Confirm mechanism and a Distribution-Ranked Reward module. It requires no architectural modifications and integrates seamlessly with existing GRPO-based VLLMs. Extensive experiments show that Dr. Seg improves performance in complex visual scenarios while preserving strong generalization.

Paper: 📖 Dr.Seg
Model: 🤗 Dr.Seg-7B
Dataset: 🤗 COCONut

Overview of Dr. Seg:

TODO

  • Release checkpoint
  • Renew README
  • Release training code
  • Release evaluation code on segmentation
  • Release dataset
  • Release evaluation code on detection and counting

Installation

git clone https://github.com/eVI-group-SCU/Dr-Seg
cd Dr-Seg
conda create -n drseg python=3.12
conda activate drseg
pip install torch==2.6.0 torchvision==0.21.0
pip install -e . --no-build-isolation

Training

Note

We recommend using 4×80GB GPUs and at least 400GB of RAM.
As a reference, it takes approximately 15 hours to run ~500 training steps on 4× H800 PCIe.
Training Data (thanks to VisionReasoner): 🤗 MultiObject-7K

(1) Download the dataset using this script:

python training_scripts/download_dataset.py

(2) Download the pretrained model using the following commands:

mkdir pretrained_models
cd pretrained_models
git lfs install
git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct

(3) Modify trainer.save_checkpoint_path in training_scripts/run_drseg_7b_4x80G.sh:

trainer.save_checkpoint_path=your_path_to_checkpoint/${RUN_NAME}

(4) Start Distribution-Ranked Reward module:

python -u drr_module/serve.py --host 127.0.0.1 --port 50070 

Note

Remember to configure Weights & Biases (wandb) correctly to upload training logs.

(5) Start training in another terminal using this script:

bash training_scripts/run_drseg_7b_4x80G.sh

Note

We recommend running training for 400–600 steps.

(6) Merge Checkpoint in Hugging Face Format:

python3 training_scripts/model_merger.py --local_dir [path_to_your_actor_checkpoint]

Evaluation

(1) Modify REASONING_MODEL_PATH in evaluation_scripts/eval_segmentation_drseg.sh:

REASONING_MODEL_PATH:=your/path/to/checkpoint

(2) (Optional) Modify TEST_DATA_PATH:=Ricky06662/ReasonSeg_val in evaluation_scripts/eval_segmentation_drseg.sh:

REASONING_MODEL_PATH:=datasete/you/want/to/eval

(3)Start evaluation:

bash evaluation_scripts/eval_segmentation_drseg.sh

Citation

@article{sun2026dr,
  title={Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design},
  author={Sun, Haoxiang and Wang, Tao and Tang, Chenwei and Yuan, Li and Lv, Jiancheng},
  journal={arXiv preprint arXiv:2603.00152},
  year={2026}
}

Acknowledgements

This project builds upon several open-source efforts, including VisionReasoner, Seg-Zero, EasyR1, veRL, and COCONut-PanCap. We also utilize pretrained models from Qwen2.5-VL and SAM2. We sincerely thank the authors and maintainers for releasing high-quality code and models, providing clear documentation and reproducible pipelines, and actively maintaining these projects, which significantly facilitated our implementation and evaluation.

Star History

Star History Chart

Releases

No releases published

Packages

 
 
 

Contributors

Languages