PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

CVPR 2026 Highlight 🔥

Weifu Fu^1,*,†, Jinyang Li^2,*, Bin-Bin Gao¹, Jialin Li³
Yuhuan Lin¹, Hanqiu Deng¹, Wenbing Tao², Yong Liu¹, Chengjie Wang¹

¹YouTu Lab, Tencent ²Huazhong University of Science and Technology ³Kling Team, Kuaishou Technology

^*Equal Contribution. ^†Corresponding Author.

💡 News

[2026.04.09] 🏆 PET-DINO is selected as a CVPR 2026 Highlight 🔥!
[2026.04.02] 🚀 Code and pre-trained models have been released.
[2026.04.01] 📄 The PET-DINO paper is now available on arXiv.
[2026.02.21] 🎉 PET-DINO is accepted by CVPR 2026!

📖 Introduction

PET-DINO is a universal detector supporting both text and visual prompts.

Alignment-Friendly Visual Prompt Generation (AFVPG): PET-DINO efficiently integrates visual cues while reducing development costs.
The first training strategies for open-set detection: Intra-Batch Parallel Prompting (IBP), Dynamic Memory-Driven Prompting (DMD).
While enhancing the model’s ability to detect complex and domain-specific objects, PET-DINO preserves the original capability of the text prompt pathway and adapts well to diverse real-world scenarios with strong open-set classification ability.

Text Prompt enables detection of objects described by text labels (e.g., "zebra . giraffe . bird"), while Visual Prompt enables detection using visual cues and can be evaluated in two modes: Visual-I and Visual-G.

🛠️ Environment

Please first install following the instructions in the get_started section, then you need to install additional dependency packages:

pip install -r requirements/multimodal.txt
pip install emoji ddd-dataset
pip install git+https://github.com/lvis-dataset/lvis-api.git

Note: The LVIS third-party library does not currently support numpy >= 1.24. Please ensure your numpy version meets the requirements. It is recommended to install numpy==1.23.

📂 Data Preparation

Pretrained Weights

Download the following pretrained models and place them under pretrained/:

Model	Path
Swin-T	`pretrained/swin/swin_tiny_patch4_window7_224.pth`
Swin-L	`pretrained/swin/swin_large_patch4_window7_224.pth`
BERT	`pretrained/bert-base-uncased/`
MM-Grounding-DINO Swin-T	`pretrained/grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_v3det_20231218_095741-e316e297.pth`
MM-Grounding-DINO Swin-L	`pretrained/grounding_dino/grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth`

For downloading BERT weights, please refer to the MM-Grounding-DINO usage guide.

Datasets

For detailed data preparation instructions, please refer to the MM-Grounding-DINO dataset preparation guide.

Training data: Objects365v1

Evaluation datasets:

The overall data directory structure should look like this:

data/
├── objects365v1/
│   ├── objects365_train.json
│   ├── objects365_val.json
│   ├── o365v1_train_od.json
│   ├── o365v1_label_map.json
│   ├── train/
│   └── val/
├── coco/
│   ├── annotations/
│   │   ├── instances_train2017.json
│   │   ├── instances_val2017.json
│   ├── train2017/
│   └── val2017/
├── lvis/
│   ├── annotations/
│   │   ├── lvis_v1_train.json
│   │   ├── lvis_v1_val.json
│   │   ├── lvis_v1_minival_inserted_image_name.json
│   ├── train2017/
│   └── val2017/
└── odinw/
    ├── AerialMaritimeDrone/
    │   ├── large/
    │   │   ├── test/
    │   │   ├── train/
    │   │   └── valid/
    │   └── tiled/
    ├── AmericanSignLanguageLetters/
    ├── Aquarium/
    └── ...  (35 datasets in total)

🚀 Train

bash tools/dist_train.sh configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py 8 --auto-scale-lr

📊 Evaluation

Checkpoints for quick-start Evaluation and Inference can be downloaded from HuggingFace.

COCO

# Visual-I
CONFIG=configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Visual'

# Text
CONFIG=configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Text'

# Visual-G (extract embedding first, then evaluate)
python scripts/image_demo.py data/coco/train2017 \
    $CONFIG --weights $CHECKPOINT \
    --prompt_type 'Visual' \
    --input_od_json 'images/cocoTrain_odvg_16.json' \
    --extract-visual-embedding --out-dir ./outputs/visual_embedding_coco/

CONFIG=configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Visual' \
    test_cfg.type='CustomTestLoop' \
    test_cfg.prompt_visual_embedding_path='./outputs/visual_embedding_coco/'

LVIS minival

# Visual-I
CONFIG=configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_lvis_minival_vi.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Visual'

# Text
CONFIG=configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_lvis_minival_vg_text.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Text'

# Visual-G
python scripts/image_demo.py data/lvis \
    $CONFIG --weights $CHECKPOINT \
    --prompt_type 'Visual' \
    --input_od_json 'images/lvisTrain_odvg_16.json' \
    --extract-visual-embedding --out-dir ./outputs/visual_embedding_lvis/

CONFIG=configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_lvis_minival_vg_text.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Visual' \
    test_cfg.type='CustomTestLoop' \
    test_cfg.prompt_visual_embedding_path='./outputs/visual_embedding_lvis/'

ODinW35

# Visual-I
CONFIG=configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_odinw35.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Visual'
# Manually copy the output to the `sample_text` variable in `scripts/extract_bbox_map_from_odinw_str_results.py`, then run:
python scripts/extract_bbox_map_from_odinw_str_results.py

# Text
CONFIG=configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_odinw35.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Text'
# Manually copy the output to the `sample_text` variable in `scripts/extract_bbox_map_from_odinw_str_results.py`, then run:
python scripts/extract_bbox_map_from_odinw_str_results.py

# Visual-G
CONFIG=./configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py
ODinW35_CONFIG=./configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_odinw35.py
bash scripts/evaluate_visual-G_of_odinw35.sh $CONFIG $ODinW35_CONFIG $CHECKPOINT 8
# Calculate the average mAP across all ODINW datasets
python scripts/get_avg_map_of_odinw.py

⚡ Inference

Text Prompt

python scripts/image_demo.py images/animals.png \
    configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py \
    --weights $CHECKPOINT --pred-score-thr 0.3 \
    --texts 'zebra. giraffe. bird' -c

Visual Prompt (Bounding Boxes)

python scripts/image_demo.py images/animals.png \
    configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py \
    --weights $CHECKPOINT --pred-score-thr 0.3 \
    --prompt_type 'Visual' \
    --prompt_bboxes '[[1291.6, 679.9, 1536.8, 840.0], [845.0, 181.9, 1210.2, 723.5], [638.0, 470.9, 815.0, 704.4]]' \
    --prompt_bboxes_labels '[30, 2, 46]'

Visual Prompt (Visual Embeddings)

# Extract visual embedding (single image)
python scripts/image_demo.py images/animals.png \
    configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py \
    --weights $CHECKPOINT \
    --prompt_type 'Visual' \
    --prompt_bboxes '[[1291.6, 679.9, 1536.8, 840.0], [845.0, 181.9, 1210.2, 723.5], [638.0, 470.9, 815.0, 704.4]]' \
    --prompt_bboxes_labels '[30, 2, 46]' \
    --extract-visual-embedding --out-dir ./outputs/visual_embedding_animals/

# Inference with visual embedding
python scripts/image_demo.py images/animals.png \
    configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py \
    --weights $CHECKPOINT --pred-score-thr 0.3 \
    --prompt_type 'Visual' \
    --prompt_visual_embedding_path outputs/visual_embedding_animals/30.pt

📜 Citation

If you find this work useful for your research, please consider citing:

@article{fu2026pet,
  title={PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training},
  author={Fu, Weifu and Li, Jinyang and Gao, Bin-Bin and Li, Jialin and Lin, Yuhuan and Deng, Hanqiu and Tao, Wenbing and Liu, Yong and Wang, Chengjie},
  journal={arXiv preprint arXiv:2604.00503},
  year={2026}
}

🤝 Acknowledgement

This project is built upon the following excellent works:

MMDetection: OpenMMLab detection toolbox and benchmark.
MM-Grounding-DINO: An open and comprehensive pipeline for unified object grounding and detection.

License

This project is released under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
configs		configs
images		images
mmdet		mmdet
requirements		requirements
scripts		scripts
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

CVPR 2026 Highlight 🔥

💡 News

📖 Introduction

🛠️ Environment

📂 Data Preparation

Pretrained Weights

Datasets

🚀 Train

📊 Evaluation

COCO

LVIS minival

ODinW35

⚡ Inference

Text Prompt

Visual Prompt (Bounding Boxes)

Visual Prompt (Visual Embeddings)

📜 Citation

🤝 Acknowledgement

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

CVPR 2026 Highlight 🔥

💡 News

📖 Introduction

🛠️ Environment

📂 Data Preparation

Pretrained Weights

Datasets

🚀 Train

📊 Evaluation

COCO

LVIS minival

ODinW35

⚡ Inference

Text Prompt

Visual Prompt (Bounding Boxes)

Visual Prompt (Visual Embeddings)

📜 Citation

🤝 Acknowledgement

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages