feat: add bioimage conversion benchmark by ash-0x00 · Pull Request #3 · iscc/twinspect

ash-0x00 · 2026-06-08T09:51:48Z

Summary

Adds a reproducible bioimage_convert_1000 dataset built from public Broad Bioimage Benchmark Collection archives for evaluating iscc-bio IMAGEWALK Data-Code matching under real, default format-conversion drift.

Each selected source image becomes a same-source cluster:

0original*
1variant_ome-tiff.ome.tiff
2variant_tiff.tiff
3variant_png.png
4variant_jp2.jp2
5variant_dicom.dcm
6variant_ics.ics

The benchmark intentionally uses Bio-Formats default conversion operators only. It does not include synthetic perturbations such as brightness shifts, blur, crop, or custom compression/quality flags.

This update also prepares TwinSpect for directory-backed microscopy outputs: .zarr/OME-NGFF directories are treated as single benchmark media inputs instead of walking and hashing their internal chunk files.

Tooling

Pins OME Bio-Formats bftools 8.5.0
Verifies bftools.zip by SHA-256: 07a3bb1d3de84da3a709655a1008cb2d9b19becc5bad4ae4112633aec9380478
Downloads to ~/.cache/twinspect/bioformats by default
Uses the Java bfconvert launcher, so the pinned converter path is cross-platform including Windows (bfconvert.bat)
Still supports overrides:
- TWINSPECT_BIOIMAGE_CONVERT_TEMPLATE
- legacy TWINSPECT_BIOIMAGE_CONVERT_BIN for imgcnv-style commands

Corrected benchmark intent

The task is to find real bioimage format conversions that produce non-identical but matchable IMAGEWALK bitstreams, not to manipulate images manually or force codec settings.

I removed the custom JPEG-compression variants because they were too artificial for this benchmark. I then tried more default target formats, including more exotic/proprietary-ish Bio-Formats targets:

OME-TIFF
TIFF
PNG
JPEG
JPEG 2000 (.jp2)
DICOM (.dcm)
ICS (.ics)
IDS (.ids)
AVI (.avi)
QuickTime (.mov)
plus BMP/GIF/FITS/PCX/PICT/PSD/PGM/PPM probes where Bio-Formats accepted the requested target

Findings from default conversions:

DICOM and ICS are useful exotic/default targets and hash cleanly after fixing TwinSpect's iscc-bio wrapper to tolerate Bio-Formats LazyBioArray planes.
IDS is not a clean single-file benchmark target with bfconvert; it creates a sidecar pair, so I excluded it.
AVI, QuickTime, and default JPEG were not robust across the mixed BBBC manifest, especially for 16-bit samples, so I excluded them from the committed 1000-cluster target set.
On sampled BBBC sources, OME-TIFF/TIFF/PNG/DICOM/ICS generally preserve identical IMAGEWALK Data-Codes.
JPEG 2000 is the best default-only non-identical candidate found so far: it stays identical on some TIFF sources, but produces converter drift on BMP sources. In the smoke run below, BBBC013 BMP → JP2 had Hamming distance 35 on a 64-bit IMAGEWALK Data-Code — non-identical, but not within the usual near-match threshold.

So the current PR is honest: it includes default exotic conversion targets and records the preliminary finding that I have not yet found a default-only proprietary/exotic single-file conversion that is both non-identical and matchable under the current 64-bit IMAGEWALK behavior. The benchmark is now positioned to measure exactly that instead of baking in artificial JPEG-quality pressure.

A known real-world stress case to add next is Zeiss CZI with lossy JPEG-XR compression converted to OME-Zarr via bioformats2raw. In prior investigation, WT A17 Myotype 2-003.czi vs WT A17 Myotype 2-003.zarr showed about 96% exact pixel agreement but different IMAGEWALK hashes: roughly 4% of uint16 pixels differed, with max difference 15789 and mean intensity CZI 576.62 vs Zarr 579.26. That is the right kind of natural decoder/conversion drift: BioIO/Bio-Formats direct CZI decode and bioformats2raw→OME-Zarr can both be technically valid decode paths for a lossy JPEG-XR source. The PR now handles .zarr directories correctly as one benchmark item so this case can be added once the source/converted sample is made reproducible for reviewers.

Validation

Local checks:

uv run pytest tests/test_processing.py tests/test_bioimage_convert_dataset.py -q
- 16 passed
uv run pytest -q
- 19 passed
uv run ruff check twinspect/algos/processing.py tests/test_processing.py twinspect/datasets/bioimage_convert.py twinspect/algos/iscc_bio.py tests/test_bioimage_convert_dataset.py tests/test_iscc_bio_algo.py
- All checks passed
uv run python -m twinspect datasets
- config loads and lists bioimage_convert_1000

End-to-end smoke conversion using pinned bftools 8.5.0:

Download/source extraction: verified from committed BBBC manifest samples
Conversion: built clusters for representative bbbc005, bbbc006_z00, and bbbc013_bmp samples
Hashing: decoded and hashed all committed variants with bioimage_data_code_iw_64

Preliminary Hamming distances, original vs converted variant:

bbbc005
- OME-TIFF: 0
- TIFF: 0
- PNG: 0
- JP2: 0
- DICOM: 0
- ICS: 0
bbbc006_z00
- OME-TIFF: 0
- TIFF: 0
- PNG: 0
- JP2: 0
- DICOM: 0
- ICS: 0
bbbc013_bmp
- OME-TIFF: 0
- TIFF: 0
- PNG: 0
- JP2: 35
- DICOM: 0
- ICS: 0

Caveats

This is a same-source conversion robustness benchmark, not a claim that every target preserves normalized pixels.
The current default-only target set is conservative: targets that failed across the mixed manifest or produced sidecar assets were excluded.
Converted image assets are generated locally and not committed.
Reviewers should regenerate the full 1000-cluster dataset and confirm whether additional source/target pairs produce non-identical-but-matchable IMAGEWALK distances.
The CZI/JPEG-XR → OME-Zarr case is documented as the likely natural drift target but is not yet included in the generated dataset because the PR still needs a reproducible public sample/derived OME-Zarr pair or a pinned bioformats2raw generation path.

This reverts commit c8139c3.

ash-0x00 added 7 commits June 8, 2026 11:50

feat: add bioimage conversion benchmark

c8c3562

feat: pin bioimage conversion tooling

39798dd

feat: add bioimage similarity variants

c8139c3

Revert "feat: add bioimage similarity variants"

8d6b7d3

This reverts commit c8139c3.

feat: use real codec conversions for bioimage variants

fde01bf

refine bioimage default conversion targets

907d62c

support zarr media inputs for bioimage drift

28834fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add bioimage conversion benchmark#3

feat: add bioimage conversion benchmark#3
ash-0x00 wants to merge 7 commits into
iscc:mainfrom
ash-0x00:feat/bioimage-convert-dataset

ash-0x00 commented Jun 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ash-0x00 commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tooling

Corrected benchmark intent

Validation

Caveats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ash-0x00 commented Jun 8, 2026 •

edited

Loading