Skip to content

feat: add bioimage conversion benchmark#3

Open
ash-0x00 wants to merge 7 commits into
iscc:mainfrom
ash-0x00:feat/bioimage-convert-dataset
Open

feat: add bioimage conversion benchmark#3
ash-0x00 wants to merge 7 commits into
iscc:mainfrom
ash-0x00:feat/bioimage-convert-dataset

Conversation

@ash-0x00

@ash-0x00 ash-0x00 commented Jun 8, 2026

Copy link
Copy Markdown

Summary

Adds a reproducible bioimage_convert_1000 dataset built from public Broad Bioimage Benchmark Collection archives for evaluating iscc-bio IMAGEWALK Data-Code matching under real, default format-conversion drift.

Each selected source image becomes a same-source cluster:

  • 0original*
  • 1variant_ome-tiff.ome.tiff
  • 2variant_tiff.tiff
  • 3variant_png.png
  • 4variant_jp2.jp2
  • 5variant_dicom.dcm
  • 6variant_ics.ics

The benchmark intentionally uses Bio-Formats default conversion operators only. It does not include synthetic perturbations such as brightness shifts, blur, crop, or custom compression/quality flags.

This update also prepares TwinSpect for directory-backed microscopy outputs: .zarr/OME-NGFF directories are treated as single benchmark media inputs instead of walking and hashing their internal chunk files.

Tooling

  • Pins OME Bio-Formats bftools 8.5.0
  • Verifies bftools.zip by SHA-256: 07a3bb1d3de84da3a709655a1008cb2d9b19becc5bad4ae4112633aec9380478
  • Downloads to ~/.cache/twinspect/bioformats by default
  • Uses the Java bfconvert launcher, so the pinned converter path is cross-platform including Windows (bfconvert.bat)
  • Still supports overrides:
    • TWINSPECT_BIOIMAGE_CONVERT_TEMPLATE
    • legacy TWINSPECT_BIOIMAGE_CONVERT_BIN for imgcnv-style commands

Corrected benchmark intent

The task is to find real bioimage format conversions that produce non-identical but matchable IMAGEWALK bitstreams, not to manipulate images manually or force codec settings.

I removed the custom JPEG-compression variants because they were too artificial for this benchmark. I then tried more default target formats, including more exotic/proprietary-ish Bio-Formats targets:

  • OME-TIFF
  • TIFF
  • PNG
  • JPEG
  • JPEG 2000 (.jp2)
  • DICOM (.dcm)
  • ICS (.ics)
  • IDS (.ids)
  • AVI (.avi)
  • QuickTime (.mov)
  • plus BMP/GIF/FITS/PCX/PICT/PSD/PGM/PPM probes where Bio-Formats accepted the requested target

Findings from default conversions:

  • DICOM and ICS are useful exotic/default targets and hash cleanly after fixing TwinSpect's iscc-bio wrapper to tolerate Bio-Formats LazyBioArray planes.
  • IDS is not a clean single-file benchmark target with bfconvert; it creates a sidecar pair, so I excluded it.
  • AVI, QuickTime, and default JPEG were not robust across the mixed BBBC manifest, especially for 16-bit samples, so I excluded them from the committed 1000-cluster target set.
  • On sampled BBBC sources, OME-TIFF/TIFF/PNG/DICOM/ICS generally preserve identical IMAGEWALK Data-Codes.
  • JPEG 2000 is the best default-only non-identical candidate found so far: it stays identical on some TIFF sources, but produces converter drift on BMP sources. In the smoke run below, BBBC013 BMP → JP2 had Hamming distance 35 on a 64-bit IMAGEWALK Data-Code — non-identical, but not within the usual near-match threshold.

So the current PR is honest: it includes default exotic conversion targets and records the preliminary finding that I have not yet found a default-only proprietary/exotic single-file conversion that is both non-identical and matchable under the current 64-bit IMAGEWALK behavior. The benchmark is now positioned to measure exactly that instead of baking in artificial JPEG-quality pressure.

A known real-world stress case to add next is Zeiss CZI with lossy JPEG-XR compression converted to OME-Zarr via bioformats2raw. In prior investigation, WT A17 Myotype 2-003.czi vs WT A17 Myotype 2-003.zarr showed about 96% exact pixel agreement but different IMAGEWALK hashes: roughly 4% of uint16 pixels differed, with max difference 15789 and mean intensity CZI 576.62 vs Zarr 579.26. That is the right kind of natural decoder/conversion drift: BioIO/Bio-Formats direct CZI decode and bioformats2raw→OME-Zarr can both be technically valid decode paths for a lossy JPEG-XR source. The PR now handles .zarr directories correctly as one benchmark item so this case can be added once the source/converted sample is made reproducible for reviewers.

Validation

Local checks:

  • uv run pytest tests/test_processing.py tests/test_bioimage_convert_dataset.py -q
    • 16 passed
  • uv run pytest -q
    • 19 passed
  • uv run ruff check twinspect/algos/processing.py tests/test_processing.py twinspect/datasets/bioimage_convert.py twinspect/algos/iscc_bio.py tests/test_bioimage_convert_dataset.py tests/test_iscc_bio_algo.py
    • All checks passed
  • uv run python -m twinspect datasets
    • config loads and lists bioimage_convert_1000

End-to-end smoke conversion using pinned bftools 8.5.0:

  • Download/source extraction: verified from committed BBBC manifest samples
  • Conversion: built clusters for representative bbbc005, bbbc006_z00, and bbbc013_bmp samples
  • Hashing: decoded and hashed all committed variants with bioimage_data_code_iw_64

Preliminary Hamming distances, original vs converted variant:

  • bbbc005
    • OME-TIFF: 0
    • TIFF: 0
    • PNG: 0
    • JP2: 0
    • DICOM: 0
    • ICS: 0
  • bbbc006_z00
    • OME-TIFF: 0
    • TIFF: 0
    • PNG: 0
    • JP2: 0
    • DICOM: 0
    • ICS: 0
  • bbbc013_bmp
    • OME-TIFF: 0
    • TIFF: 0
    • PNG: 0
    • JP2: 35
    • DICOM: 0
    • ICS: 0

Caveats

  • This is a same-source conversion robustness benchmark, not a claim that every target preserves normalized pixels.
  • The current default-only target set is conservative: targets that failed across the mixed manifest or produced sidecar assets were excluded.
  • Converted image assets are generated locally and not committed.
  • Reviewers should regenerate the full 1000-cluster dataset and confirm whether additional source/target pairs produce non-identical-but-matchable IMAGEWALK distances.
  • The CZI/JPEG-XR → OME-Zarr case is documented as the likely natural drift target but is not yet included in the generated dataset because the PR still needs a reproducible public sample/derived OME-Zarr pair or a pinned bioformats2raw generation path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant