SearchAD Devkit

Official development toolkit for SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving — accepted at CVPR 2026.

Felix Embacher, Jonas Uhrig, Marius Cordts, Markus Enzweiler
Mercedes-Benz AG & Esslingen University of Applied Sciences

About SearchAD

Retrieving rare and safety-critical driving scenarios from large-scale datasets is essential for building robust autonomous driving systems. As dataset sizes grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples.

SearchAD tackles the needle-in-a-haystack problem: locating extremely rare classes — some appearing fewer than 50 times across the entire SearchAD dataset with more than 423k frames.

Key Facts

Property	Value
Total frames	423,798
Bounding box annotations	513,265+
Rare classes	90
Additional classes	Animal-Real-Other, Animal-Statue-Other, Human-Duty-Other, Object-Movable-Other, Vehicle-Construction-Other
Category groups	9 (Animal, Human, Marking, Object, Rideable, Scene, Sign, Trailer, Vehicle)
Source datasets	11 established AD datasets

Source Datasets

Dataset	Val. Set	Frames	Original Classes	SearchAD Classes	Annotations	Download Instructions
Lost and Found		2,239	42	18	2,098	Download leftImg8bit/
WildDash2	✓	5,068	26	80	5,032	Download wd_public_v2p0.zip and wd_both_02.zip
ACDC	✓	8,012	19	60	7,471	Download rgb_anon_trainvaltest.zip
IDD Segmentation	✓	10,003	30	52	12,192	Download IDD Segmentation (IDD 20k Part I) (18.5 GB)
KITTI		14,999	8	47	9,840	Download left color images of object data set (12 GB)
Cityscapes	✓	24,998	30	75	31,037	Download leftImg8bit_trainvaltest.zip and leftImg8bit_trainextra.zip
Mapillary Vistas	✓	25,000	66	86	35,093	Download mapillary-vistas-dataset_public_v2.0.zip
ECP	✓	47,335	8	76	33,081	Download ECP day and night, train, val, test (6 zip files)
nuScenes	✓	80,314	32	56	166,152	Download Trainval and Test
BDD100K	✓	100,000	12	80	83,102	Download 100K Images
Mapillary Sign	✓	105,830	313	90	128,167	Click Download dataset
SearchAD (combined)		423,798	—	90	513,265

Tasks

SearchAD supports a range of retrieval and perception tasks:

Retrieval & Search

Text-to-image retrieval — language-guided search using text queries
Image-to-image retrieval — vision-guided search using image support sets
Few-shot learning — few-shot learning using samples of the SearchAD train dataset
Fine-tuning — fine-tuning VLMs on the SearchAD train dataset

Beyond Retrieval

Long-tail perception research — benchmark model robustness on extremely rare real-world classes
Open-world and open-vocabulary object detection — evaluate detectors on rare, unseen, or underrepresented categories
Out-of-distribution (OOD) detection — identify rare or anomalous objects outside typical training distributions
Retrieval-driven data curation — use search to actively mine relevant training samples from large unlabeled datasets
Domain adaptation — leverage the diversity of 11 source datasets for cross-domain generalization research

Installation

Install the devkit via pip directly from the repository:

pip install .

Or install in editable mode with development dependencies:

pip install -e ".[dev]"

Requirements: Python >= 3.10. Dependencies (torch, opencv-python, numpy, matplotlib) are installed automatically.

Scripts

All scripts are run from the root of the repository and expect the searchad/ package to be installed or on the Python path.

Download Dataset

Downloads the SearchAD annotations and default queries from HuggingFace and prints step-by-step instructions for downloading the 11 source dataset image archives.

python searchad/download_dataset.py \
    --searchad-dir "/path/to/searchad" \
    --hf-token "hf_xxxxxxxxxxxx"

Argument	Description
`--searchad-dir`	Directory where the SearchAD folder will be created
`--hf-token`	HuggingFace access token for gated datasets (or set `HF_TOKEN` env variable)

The script:

Downloads searchad.zip (~11 MB) from HuggingFace
Extracts annotation JSON files (train, val, test mapping) and the default_queries/ folder into --output-dir
Prints download instructions for all 11 source datasets

Note: Due to the dataset licenses, the 423k source images must be downloaded manually from the individual dataset hosts (see printed instructions or the table above). Only the annotations and default queries are hosted on HuggingFace.

After downloading and extracting everything, your SearchAD directory should look like this:

searchad/
├── ECP/
│   ├── ...
├── IDD_Segmentation/
│   ├── ...
├── acdc/
│   ├── ...
├── bdd100k_images_100k/
│   ├── ...
├── cityscapes/
│   ├── ...
├── kitti/
│   ├── ...
├── lostandfound/
│   ├── ...
├── mapillary_sign/
│   ├── ...
├── mapillary_vistas/
│   ├── ...
├── nuscenes/
│   ├── ...
├── wd_both02/
│   ├── ...
├── wd_publicv2p0/
│   ├── ...
├── searchad_annotations_train.json
├── searchad_annotations_val.json
├── searchad_test_mapping_id_to_imagepath.json
└── default_queries/
    ├── ...

Prepare Image-Level Annotations

Derives image-level annotations from bounding-box annotations for a given split.

python searchad/prepare_image_level_annotations.py \
    --searchad-dir "/path/to/searchad" \
    --split "val"

Argument	Description
`--searchad-dir`	Path to the SearchAD directory
`--split`	Dataset split to process: `train` or `val`

Check Dataset Setup

Verify that your SearchAD directory is complete and correctly structured before running any other scripts.

python searchad/check_searchad_setup.py \
    --searchad-dir "/path/to/searchad"

Argument	Description
`--searchad-dir`	Path to the SearchAD directory to check

The script checks for:

Required annotation JSON files (train, val, test mapping)
Optional generated image-level annotation files
Presence of all expected SearchAD labels in the annotations
default_queries/ folder with one JSON per label
All subdataset directories and that referenced image paths exist on disk
Summary of total image counts per split

Prune SearchAD Subdatasets

Removes all non-essential files from specific subdatasets, keeping only the required image data and license files. Since SearchAD only uses front and back camera images, this significantly reduces disk usage for datasets that include additional sensors, annotations, or camera perspectives.

python searchad/prune_searchad_datasets.py \
    --searchad-dir "/path/to/searchad" \
    --datasets-to-prune \
        "acdc" \
        "bdd100k_images_100k" \
        "cityscapes" \
        "ECP" \
        "IDD_Segmentation" \
        "kitti" \
        "lostandfound" \
        "mapillary_sign" \
        "mapillary_vistas" \
        "nuscenes" \
        "wd_both02" \
        "wd_publicv2p0"

Argument	Description
`--searchad-dir`	Path to the SearchAD directory
`--datasets-to-prune`	Space-separated list of dataset names to prune (e.g., `acdc mapillary_sign`)

Create Dummy Predictions

Generate a random baseline predictions file for a given split, useful for verifying the evaluation script (val set) and the submission script (test set) below.

python searchad/create_dummy_predictions.py \
    --searchad-dir "/path/to/searchad" \
    --predictions-file "/path/to/results_dummy_val.json" \
    --split "val"

Argument	Description
`--searchad-dir`	Path to the SearchAD directory
`--predictions-file`	Path where the output predictions JSON file will be saved
`--split`	Split to generate predictions for: `train`, `val`, or `test`
`--seed`	Random seed for reproducibility (default: `42`)

Evaluate Predictions

Compute metrics (MAP, R-Precision, P@5) for your predictions against the ground-truth annotations.

python searchad/evaluate.py \
    --predictions-file "/path/to/your/results.json" \
    --split "val" \
    --searchad-dir "/path/to/searchad" \
    --scores-output-dir "/path/to/scores_output"

Argument	Description
`--predictions-file`	Path to your predictions JSON file
`--split`	Dataset split to evaluate on (`val` or `train`)
`--searchad-dir`	Path to the SearchAD directory
`--scores-output-dir`	Directory where score outputs will be saved

The predictions file contains a ranked imagelist (37,325 entries for val) for each SearchAD label. Paths are relative to the SearchAD directory and must start with the subdataset name:

{
    "Animal-Real-Cat": [
        "acdc/rgb_anon/train/fog/GOPR0475/GOPR0475_frame_000001_rgb_anon.png",
        "bdd100k_images_100k/images/100k/train/0000f77c-6257be58.jpg",
        "..."
    ],
    "Animal-Real-Cow": ["..."],
    "...",
    "Vehicle-Train": ["..."]
}

Create Submission File

Packages your predictions into a submission.json file ready for upload to the SearchAD Benchmark Server.

python searchad/create_submission_file.py \
    --searchad-dir "/path/to/searchad" \
    --predictions-file "/path/to/your/results.json" \
    --submission-output-dir "/path/to/submission_output" \
    --team-name "YOUR_TEAM_NAME" \
    --model-name "YOUR_MODEL_NAME" \
    --paper-code-affiliation "YOUR_PAPER_CODE_OR_AFFILIATION" \
    --search-mode "Language" \
    --searchad-train "No" \
    --default-queries "Yes"

Argument	Description
`--searchad-dir`	Path to the SearchAD directory
`--predictions-file`	Path to your predictions JSON file
`--submission-output-dir`	Directory where `submission.json` will be saved
`--team-name`	Name of your team
`--model-name`	Name of your model
`--paper-code-affiliation`	Link to paper/code or affiliation
`--search-mode`	Search mode used: `Language`, `Vision`, or `Multimodal`
`--searchad-train`	Whether the SearchAD training set was used: `Yes` or `No`
`--default-queries`	Whether default queries were used: `Yes` or `No`
`--overwrite`	If set, overwrite an existing `submission.json` instead of raising an error

Note: Before uploading submission.json on the benchmark page, click the blue "Login with Hugging Face" button in the bottom left. If the button does not respond, try a different browser — Safari has been observed to silently fail the login, causing an "invalid token" error on submission.

Note: If your submission stays in PROCESSING for more than an hour, your HuggingFace authorization token may be missing a required scope. Go to huggingface.co/settings/connected-applications, revoke the benchmark Space's access, then log in again — you should be prompted to grant read access to private and gated repositories.

Dataset Statistics

Print the label distribution table for the SearchAD dataset, broken down by split optionally per-subdataset using the --by-subdataset flag.

python searchad/print_dataset_statistics.py \
    --searchad-dir "/path/to/searchad" \
    --splits "train" "val" \
    --level "object" \
    --statistics-type "absolute" \
    --output-dir "/path/to/dataset_statistics" \
    --by-subdataset

Argument	Description
`--searchad-dir`	Path to the SearchAD directory
`--splits`	One or more splits to include: `train`, `val` (default: both)
`--level`	`object` counts total bounding-box annotations per label; `image` counts images containing each label (default: `object`)
`--statistics-type`	`absolute` for raw counts; `relative` for each split's fraction of the label total (default: `absolute`)
`--output-dir`	Optional directory to save the table(s) as `.txt` files
`--by-subdataset`	If set, also prints a second table with per-subdataset counts

Visualize Bounding Boxes

Visualize bounding box annotations for a specific label on sample images from a given split.

python searchad/visualize_bbox_labels.py \
    --searchad-dir "/path/to/searchad" \
    --output-dir "/path/to/output_bbox_visualizations" \
    --searchad-label "Animal-Real-Cat" \
    --split "train" \
    --num-images 5

Argument	Description
`--searchad-dir`	Path to the SearchAD directory
`--output-dir`	Directory where visualizations will be saved
`--searchad-label`	Label to visualize (e.g., `Animal-Real-Cat`); omit to visualize all labels
`--split`	Dataset split: `train` or `val`
`--num-images`	Number of images to visualize
`--shorten-labels`	If set, display shortened label names instead of full ones
`--hide-labels`	If set, draw bounding boxes without any label text
`--only-target-label`	If set, only draw boxes for the target label; all other labels in the image are suppressed

Visualize Retrieval Results

Visualize the top-k retrieved images for each query label, highlighted with green borders for correct and red borders for incorrect retrievals, plus bounding box annotations.

python searchad/visualize_retrieval.py \
    --predictions-file "/path/to/your/results.json" \
    --searchad-dir "/path/to/searchad" \
    --visualization-output-dir "/path/to/output_retrieval_visualizations" \
    --split "val" \
    --topk 5 \
    --searchad-label "Animal-Real-Cat" \
    --resize-for-collage \
    --shorten-labels

Argument	Description
`--predictions-file`	Path to your predictions JSON file
`--searchad-dir`	Path to the SearchAD directory
`--visualization-output-dir`	Directory where visualized images and collages will be saved
`--split`	Dataset split: `train` or `val`
`--topk`	Number of top retrieved images to visualize per query (default: `10`)
`--searchad-label`	Optional: visualize only this label (e.g., `Animal-Real-Cat`); omit to visualize all
`--resize-for-collage`	If set, resize images to a consistent 16:9 resolution before drawing
`--shorten-labels`	If set, display shortened label names instead of full ones

Visualize Support Sets

Generate collage visualizations of the vision support sets for each query.

python searchad/visualize_support_sets.py \
    --searchad-dir "/path/to/searchad" \
    --output-collage-dir "/path/to/support_set_collage_visu_output" \
    --crop-size 256 \
    --searchad-label "Animal-Real-Cat"

Argument	Description
`--searchad-dir`	Path to the SearchAD directory
`--output-collage-dir`	Directory where collage images will be saved
`--crop-size`	Size (in pixels) to crop each support set image to
`--searchad-label`	Optional: visualize only this label's support set (e.g. `Animal-Real-Cat`); omit to visualize all

Citation

If you use SearchAD in your research, please cite:

@article{embacher2026searchadlargescalerareimage,
    title={SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving},
    author={Felix Embacher and Jonas Uhrig and Marius Cordts and Markus Enzweiler},
    year={2026},
    eprint={2604.08008},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2604.08008},
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
searchad		searchad
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SearchAD Devkit

About SearchAD

Key Facts

Source Datasets

Tasks

Installation

Scripts

Download Dataset

Prepare Image-Level Annotations

Check Dataset Setup

Prune SearchAD Subdatasets

Create Dummy Predictions

Evaluate Predictions

Create Submission File

Dataset Statistics

Visualize Bounding Boxes

Visualize Retrieval Results

Visualize Support Sets

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SearchAD Devkit

About SearchAD

Key Facts

Source Datasets

Tasks

Installation

Scripts

Download Dataset

Prepare Image-Level Annotations

Check Dataset Setup

Prune SearchAD Subdatasets

Create Dummy Predictions

Evaluate Predictions

Create Submission File

Dataset Statistics

Visualize Bounding Boxes

Visualize Retrieval Results

Visualize Support Sets

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages