Weifu Fu1,*,β ,
Jinyang Li2,*,
Bin-Bin Gao1,
Jialin Li3
Yuhuan Lin1,
Hanqiu Deng1,
Wenbing Tao2,
Yong Liu1,
Chengjie Wang1
1YouTu Lab, Tencent Β Β
2Huazhong University of Science and Technology Β Β
3Kling Team, Kuaishou Technology
*Equal Contribution. Β Β β Corresponding Author.
- [2026.04.09] π PET-DINO is selected as a CVPR 2026 Highlight π₯!
- [2026.04.02] π Code and pre-trained models have been released.
- [2026.04.01] π The PET-DINO paper is now available on arXiv.
- [2026.02.21] π PET-DINO is accepted by CVPR 2026!
PET-DINO is a universal detector supporting both text and visual prompts.
- Alignment-Friendly Visual Prompt Generation (AFVPG): PET-DINO efficiently integrates visual cues while reducing development costs.
- The first training strategies for open-set detection: Intra-Batch Parallel Prompting (IBP), Dynamic Memory-Driven Prompting (DMD).
- While enhancing the modelβs ability to detect complex and domain-specific objects, PET-DINO preserves the original capability of the text prompt pathway and adapts well to diverse real-world scenarios with strong open-set classification ability.
Text Prompt enables detection of objects described by text labels (e.g., "zebra . giraffe . bird"), while Visual Prompt enables detection using visual cues and can be evaluated in two modes: Visual-I and Visual-G.
Please first install following the instructions in the get_started section, then you need to install additional dependency packages:
pip install -r requirements/multimodal.txt
pip install emoji ddd-dataset
pip install git+https://github.com/lvis-dataset/lvis-api.gitNote: The LVIS third-party library does not currently support numpy >= 1.24. Please ensure your numpy version meets the requirements. It is recommended to install
numpy==1.23.
Download the following pretrained models and place them under pretrained/:
| Model | Path |
|---|---|
| Swin-T | pretrained/swin/swin_tiny_patch4_window7_224.pth |
| Swin-L | pretrained/swin/swin_large_patch4_window7_224.pth |
| BERT | pretrained/bert-base-uncased/ |
| MM-Grounding-DINO Swin-T | pretrained/grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_v3det_20231218_095741-e316e297.pth |
| MM-Grounding-DINO Swin-L | pretrained/grounding_dino/grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth |
For downloading BERT weights, please refer to the MM-Grounding-DINO usage guide.
For detailed data preparation instructions, please refer to the MM-Grounding-DINO dataset preparation guide.
Training data: Objects365v1
Evaluation datasets:
The overall data directory structure should look like this:
data/
βββ objects365v1/
β βββ objects365_train.json
β βββ objects365_val.json
β βββ o365v1_train_od.json
β βββ o365v1_label_map.json
β βββ train/
β βββ val/
βββ coco/
β βββ annotations/
β β βββ instances_train2017.json
β β βββ instances_val2017.json
β βββ train2017/
β βββ val2017/
βββ lvis/
β βββ annotations/
β β βββ lvis_v1_train.json
β β βββ lvis_v1_val.json
β β βββ lvis_v1_minival_inserted_image_name.json
β βββ train2017/
β βββ val2017/
βββ odinw/
βββ AerialMaritimeDrone/
β βββ large/
β β βββ test/
β β βββ train/
β β βββ valid/
β βββ tiled/
βββ AmericanSignLanguageLetters/
βββ Aquarium/
βββ ... (35 datasets in total)
bash tools/dist_train.sh configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py 8 --auto-scale-lrCheckpoints for quick-start Evaluation and Inference can be downloaded from HuggingFace.
# Visual-I
CONFIG=configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
--cfg-options model.test_cfg.prompt_type='Visual'
# Text
CONFIG=configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
--cfg-options model.test_cfg.prompt_type='Text'
# Visual-G (extract embedding first, then evaluate)
python scripts/image_demo.py data/coco/train2017 \
$CONFIG --weights $CHECKPOINT \
--prompt_type 'Visual' \
--input_od_json 'images/cocoTrain_odvg_16.json' \
--extract-visual-embedding --out-dir ./outputs/visual_embedding_coco/
CONFIG=configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
--cfg-options model.test_cfg.prompt_type='Visual' \
test_cfg.type='CustomTestLoop' \
test_cfg.prompt_visual_embedding_path='./outputs/visual_embedding_coco/'# Visual-I
CONFIG=configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_lvis_minival_vi.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
--cfg-options model.test_cfg.prompt_type='Visual'
# Text
CONFIG=configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_lvis_minival_vg_text.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
--cfg-options model.test_cfg.prompt_type='Text'
# Visual-G
python scripts/image_demo.py data/lvis \
$CONFIG --weights $CHECKPOINT \
--prompt_type 'Visual' \
--input_od_json 'images/lvisTrain_odvg_16.json' \
--extract-visual-embedding --out-dir ./outputs/visual_embedding_lvis/
CONFIG=configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_lvis_minival_vg_text.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
--cfg-options model.test_cfg.prompt_type='Visual' \
test_cfg.type='CustomTestLoop' \
test_cfg.prompt_visual_embedding_path='./outputs/visual_embedding_lvis/'# Visual-I
CONFIG=configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_odinw35.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
--cfg-options model.test_cfg.prompt_type='Visual'
# Manually copy the output to the `sample_text` variable in `scripts/extract_bbox_map_from_odinw_str_results.py`, then run:
python scripts/extract_bbox_map_from_odinw_str_results.py
# Text
CONFIG=configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_odinw35.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
--cfg-options model.test_cfg.prompt_type='Text'
# Manually copy the output to the `sample_text` variable in `scripts/extract_bbox_map_from_odinw_str_results.py`, then run:
python scripts/extract_bbox_map_from_odinw_str_results.py
# Visual-G
CONFIG=./configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py
ODinW35_CONFIG=./configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_odinw35.py
bash scripts/evaluate_visual-G_of_odinw35.sh $CONFIG $ODinW35_CONFIG $CHECKPOINT 8
# Calculate the average mAP across all ODINW datasets
python scripts/get_avg_map_of_odinw.pypython scripts/image_demo.py images/animals.png \
configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py \
--weights $CHECKPOINT --pred-score-thr 0.3 \
--texts 'zebra. giraffe. bird' -cpython scripts/image_demo.py images/animals.png \
configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py \
--weights $CHECKPOINT --pred-score-thr 0.3 \
--prompt_type 'Visual' \
--prompt_bboxes '[[1291.6, 679.9, 1536.8, 840.0], [845.0, 181.9, 1210.2, 723.5], [638.0, 470.9, 815.0, 704.4]]' \
--prompt_bboxes_labels '[30, 2, 46]'# Extract visual embedding (single image)
python scripts/image_demo.py images/animals.png \
configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py \
--weights $CHECKPOINT \
--prompt_type 'Visual' \
--prompt_bboxes '[[1291.6, 679.9, 1536.8, 840.0], [845.0, 181.9, 1210.2, 723.5], [638.0, 470.9, 815.0, 704.4]]' \
--prompt_bboxes_labels '[30, 2, 46]' \
--extract-visual-embedding --out-dir ./outputs/visual_embedding_animals/
# Inference with visual embedding
python scripts/image_demo.py images/animals.png \
configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py \
--weights $CHECKPOINT --pred-score-thr 0.3 \
--prompt_type 'Visual' \
--prompt_visual_embedding_path outputs/visual_embedding_animals/30.ptIf you find this work useful for your research, please consider citing:
@article{fu2026pet,
title={PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training},
author={Fu, Weifu and Li, Jinyang and Gao, Bin-Bin and Li, Jialin and Lin, Yuhuan and Deng, Hanqiu and Tao, Wenbing and Liu, Yong and Wang, Chengjie},
journal={arXiv preprint arXiv:2604.00503},
year={2026}
}This project is built upon the following excellent works:
- MMDetection: OpenMMLab detection toolbox and benchmark.
- MM-Grounding-DINO: An open and comprehensive pipeline for unified object grounding and detection.
This project is released under the Apache 2.0 license.
