Skip to content

fuweifuvtoo/PET_DINO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

CVPR 2026 Highlight πŸ”₯

Weifu Fu1,*,†, Jinyang Li2,*, Bin-Bin Gao1, Jialin Li3
Yuhuan Lin1, Hanqiu Deng1, Wenbing Tao2, Yong Liu1, Chengjie Wang1

1YouTu Lab, Tencent Β Β  2Huazhong University of Science and Technology Β Β  3Kling Team, Kuaishou Technology

*Equal Contribution. Β Β  †Corresponding Author.

arXiv Β  Home Page

πŸ’‘ News

  • [2026.04.09] πŸ† PET-DINO is selected as a CVPR 2026 Highlight πŸ”₯!
  • [2026.04.02] πŸš€ Code and pre-trained models have been released.
  • [2026.04.01] πŸ“„ The PET-DINO paper is now available on arXiv.
  • [2026.02.21] πŸŽ‰ PET-DINO is accepted by CVPR 2026!

πŸ“– Introduction

PET-DINO is a universal detector supporting both text and visual prompts.

  • Alignment-Friendly Visual Prompt Generation (AFVPG): PET-DINO efficiently integrates visual cues while reducing development costs.
  • The first training strategies for open-set detection: Intra-Batch Parallel Prompting (IBP), Dynamic Memory-Driven Prompting (DMD).
  • While enhancing the model’s ability to detect complex and domain-specific objects, PET-DINO preserves the original capability of the text prompt pathway and adapts well to diverse real-world scenarios with strong open-set classification ability.

Text Prompt enables detection of objects described by text labels (e.g., "zebra . giraffe . bird"), while Visual Prompt enables detection using visual cues and can be evaluated in two modes: Visual-I and Visual-G.

πŸ› οΈ Environment

Please first install following the instructions in the get_started section, then you need to install additional dependency packages:

pip install -r requirements/multimodal.txt
pip install emoji ddd-dataset
pip install git+https://github.com/lvis-dataset/lvis-api.git

Note: The LVIS third-party library does not currently support numpy >= 1.24. Please ensure your numpy version meets the requirements. It is recommended to install numpy==1.23.

πŸ“‚ Data Preparation

Pretrained Weights

Download the following pretrained models and place them under pretrained/:

Model Path
Swin-T pretrained/swin/swin_tiny_patch4_window7_224.pth
Swin-L pretrained/swin/swin_large_patch4_window7_224.pth
BERT pretrained/bert-base-uncased/
MM-Grounding-DINO Swin-T pretrained/grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_v3det_20231218_095741-e316e297.pth
MM-Grounding-DINO Swin-L pretrained/grounding_dino/grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth

For downloading BERT weights, please refer to the MM-Grounding-DINO usage guide.

Datasets

For detailed data preparation instructions, please refer to the MM-Grounding-DINO dataset preparation guide.

Training data: Objects365v1

Evaluation datasets:

The overall data directory structure should look like this:

data/
β”œβ”€β”€ objects365v1/
β”‚   β”œβ”€β”€ objects365_train.json
β”‚   β”œβ”€β”€ objects365_val.json
β”‚   β”œβ”€β”€ o365v1_train_od.json
β”‚   β”œβ”€β”€ o365v1_label_map.json
β”‚   β”œβ”€β”€ train/
β”‚   └── val/
β”œβ”€β”€ coco/
β”‚   β”œβ”€β”€ annotations/
β”‚   β”‚   β”œβ”€β”€ instances_train2017.json
β”‚   β”‚   β”œβ”€β”€ instances_val2017.json
β”‚   β”œβ”€β”€ train2017/
β”‚   └── val2017/
β”œβ”€β”€ lvis/
β”‚   β”œβ”€β”€ annotations/
β”‚   β”‚   β”œβ”€β”€ lvis_v1_train.json
β”‚   β”‚   β”œβ”€β”€ lvis_v1_val.json
β”‚   β”‚   β”œβ”€β”€ lvis_v1_minival_inserted_image_name.json
β”‚   β”œβ”€β”€ train2017/
β”‚   └── val2017/
└── odinw/
    β”œβ”€β”€ AerialMaritimeDrone/
    β”‚   β”œβ”€β”€ large/
    β”‚   β”‚   β”œβ”€β”€ test/
    β”‚   β”‚   β”œβ”€β”€ train/
    β”‚   β”‚   └── valid/
    β”‚   └── tiled/
    β”œβ”€β”€ AmericanSignLanguageLetters/
    β”œβ”€β”€ Aquarium/
    └── ...  (35 datasets in total)

πŸš€ Train

bash tools/dist_train.sh configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py 8 --auto-scale-lr

πŸ“Š Evaluation

Checkpoints for quick-start Evaluation and Inference can be downloaded from HuggingFace.

COCO

# Visual-I
CONFIG=configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Visual'

# Text
CONFIG=configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Text'

# Visual-G (extract embedding first, then evaluate)
python scripts/image_demo.py data/coco/train2017 \
    $CONFIG --weights $CHECKPOINT \
    --prompt_type 'Visual' \
    --input_od_json 'images/cocoTrain_odvg_16.json' \
    --extract-visual-embedding --out-dir ./outputs/visual_embedding_coco/

CONFIG=configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Visual' \
    test_cfg.type='CustomTestLoop' \
    test_cfg.prompt_visual_embedding_path='./outputs/visual_embedding_coco/'

LVIS minival

# Visual-I
CONFIG=configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_lvis_minival_vi.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Visual'

# Text
CONFIG=configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_lvis_minival_vg_text.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Text'

# Visual-G
python scripts/image_demo.py data/lvis \
    $CONFIG --weights $CHECKPOINT \
    --prompt_type 'Visual' \
    --input_od_json 'images/lvisTrain_odvg_16.json' \
    --extract-visual-embedding --out-dir ./outputs/visual_embedding_lvis/

CONFIG=configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_lvis_minival_vg_text.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Visual' \
    test_cfg.type='CustomTestLoop' \
    test_cfg.prompt_visual_embedding_path='./outputs/visual_embedding_lvis/'

ODinW35

# Visual-I
CONFIG=configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_odinw35.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Visual'
# Manually copy the output to the `sample_text` variable in `scripts/extract_bbox_map_from_odinw_str_results.py`, then run:
python scripts/extract_bbox_map_from_odinw_str_results.py

# Text
CONFIG=configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_odinw35.py
bash tools/dist_test.sh $CONFIG $CHECKPOINT 8 \
    --cfg-options model.test_cfg.prompt_type='Text'
# Manually copy the output to the `sample_text` variable in `scripts/extract_bbox_map_from_odinw_str_results.py`, then run:
python scripts/extract_bbox_map_from_odinw_str_results.py

# Visual-G
CONFIG=./configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py
ODinW35_CONFIG=./configs/pet_dino/val/pet_dino_swin-t_8xb4_12e_obj365_evaluate_odinw35.py
bash scripts/evaluate_visual-G_of_odinw35.sh $CONFIG $ODinW35_CONFIG $CHECKPOINT 8
# Calculate the average mAP across all ODINW datasets
python scripts/get_avg_map_of_odinw.py

⚑ Inference

Text Prompt

python scripts/image_demo.py images/animals.png \
    configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py \
    --weights $CHECKPOINT --pred-score-thr 0.3 \
    --texts 'zebra. giraffe. bird' -c

Visual Prompt (Bounding Boxes)

python scripts/image_demo.py images/animals.png \
    configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py \
    --weights $CHECKPOINT --pred-score-thr 0.3 \
    --prompt_type 'Visual' \
    --prompt_bboxes '[[1291.6, 679.9, 1536.8, 840.0], [845.0, 181.9, 1210.2, 723.5], [638.0, 470.9, 815.0, 704.4]]' \
    --prompt_bboxes_labels '[30, 2, 46]'

Visual Prompt (Visual Embeddings)

# Extract visual embedding (single image)
python scripts/image_demo.py images/animals.png \
    configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py \
    --weights $CHECKPOINT \
    --prompt_type 'Visual' \
    --prompt_bboxes '[[1291.6, 679.9, 1536.8, 840.0], [845.0, 181.9, 1210.2, 723.5], [638.0, 470.9, 815.0, 704.4]]' \
    --prompt_bboxes_labels '[30, 2, 46]' \
    --extract-visual-embedding --out-dir ./outputs/visual_embedding_animals/

# Inference with visual embedding
python scripts/image_demo.py images/animals.png \
    configs/pet_dino/pet_dino_swin-t_8xb4_12e_obj365.py \
    --weights $CHECKPOINT --pred-score-thr 0.3 \
    --prompt_type 'Visual' \
    --prompt_visual_embedding_path outputs/visual_embedding_animals/30.pt

πŸ“œ Citation

If you find this work useful for your research, please consider citing:

@article{fu2026pet,
  title={PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training},
  author={Fu, Weifu and Li, Jinyang and Gao, Bin-Bin and Li, Jialin and Lin, Yuhuan and Deng, Hanqiu and Tao, Wenbing and Liu, Yong and Wang, Chengjie},
  journal={arXiv preprint arXiv:2604.00503},
  year={2026}
}

🀝 Acknowledgement

This project is built upon the following excellent works:

  • MMDetection: OpenMMLab detection toolbox and benchmark.
  • MM-Grounding-DINO: An open and comprehensive pipeline for unified object grounding and detection.

License

This project is released under the Apache 2.0 license.

About

[CVPR 2026 Highlight πŸ”₯] PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages