diff --git a/docs/design/quality-and-dedupe.md b/docs/design/quality-and-dedupe.md new file mode 100644 index 0000000..1c8f89c --- /dev/null +++ b/docs/design/quality-and-dedupe.md @@ -0,0 +1,155 @@ +# Quality Metrics and Near-Duplicate Deduplication (M4) + +**Status:** Implemented +**Milestone:** M4 +**Issue:** #22 +**Date:** 2026-05-25 + +--- + +## Motivation + +The CCA extractor (M3) annotates every connected component that passes the +size and area filters. Two failure modes degrade downstream HTR training: + +1. **Low-quality crops** — crops with very little ink (noise blobs, stray + marks, artefacts) or fully filled bboxes (ruled lines, bleed-through). +2. **Near-duplicate variants** — the same physical letter glyph annotated more + than once, or different crops of the same ink stroke from overlapping + bounding boxes. Near-dupes inflate the variant count without adding visual + diversity and can bias classifiers toward specific writers. + +M4 addresses both by embedding a per-variant quality metric in the output +schema and running a greedy near-duplicate dedup pass before emitting +`letter_set.json`. + +--- + +## Quality Metric: `ink_ratio` + +### Definition + +``` +ink_ratio = count(foreground pixels in bbox) / (bbox_width × bbox_height) +``` + +Foreground pixels are those with value > 0 in the Otsu-binarised image +(i.e., ink pixels after `THRESH_BINARY_INV`). + +### Rationale + +- **Computable with no extra dependencies** — the binarised array is already + in memory from the crop step. +- **Interpretable** — maps directly onto visual ink density. +- **Actionable** — consumers can threshold on `ink_ratio` to discard near-empty + or fully-filled crops without re-processing scans. +- **Typical range for legible Hebrew glyphs** — 0.10-0.60; outside this band + is a quality signal (not a hard filter in M4, left to consumers). + +### Implementation + +`compute_ink_ratio(binary, glyph)` in `extractor.py`: + +```python +crop = binary[glyph.y : glyph.y + glyph.height, glyph.x : glyph.x + glyph.width] +ink_px = int((crop > 0).sum()) +return ink_px / (glyph.width * glyph.height) +``` + +The value is embedded in the `quality` sub-object of every variant in +`letter_set.json` and validated against the updated `letter_set.schema.json` +(where `quality` is now a **required** field on `variant`). + +--- + +## Near-Duplicate Deduplication: dHash + Hamming Distance + +### Algorithm + +**Step 1 — Compute a 64-bit difference hash (dHash) per crop.** + +dHash is a perceptual hash that captures the gradient structure of an image. +For each glyph crop: + +1. Resize the crop to `(hash_size + 1) x hash_size` pixels using bilinear + interpolation (`hash_size = 8` by default → 9 x 8 = 72 pixels). +2. Compare adjacent pixels in each row: for column `c` in row `r`, set bit 1 + if `pixel[r, c] > pixel[r, c+1]`, else 0. +3. Pack all `hash_size²` (64 at default) bits into a single Python `int`. + +**Step 2 — Greedy single-pass clustering per letter.** + +For each annotated letter (e.g. `'א'`), process variants in arrival order: + +- Compare the candidate's dHash against every already-selected + representative using Hamming distance. +- If the minimum Hamming distance is ≤ `_DEDUP_HAMMING_THRESHOLD` (10), + the candidate is a near-duplicate of that representative: + - Keep whichever has the higher `ink_ratio`. +- If no representative is within threshold, add the candidate as a new + representative. + +**Step 3 — Strip the internal `_dhash` key before writing.** + +dHash is used only during generation; it is not part of the schema and is +removed from every variant dict before `letter_set.json` is written. + +### Design Decisions + +| Decision | Choice | Rationale | +|---|---|---| +| Hash algorithm | dHash | Pure OpenCV (no extra deps); fast; well-suited for binary glyph images | +| Hash size | 8 (64 bits) | Standard; good sensitivity/collision balance for glyph-sized images | +| Hamming threshold | 10 / 64 bits (~15 %) | Conventional loose threshold for perceptual hashing; empirical calibration deferred | +| Dedup scope | Per letter, per writer | Cross-letter dedup is out of scope; cross-writer dedup is a downstream concern | +| Clustering algorithm | Greedy single-pass | O(n²) per letter, but n is small (< 20 variants/letter per writer in practice) | +| Tie-breaking | Higher `ink_ratio` | Proxy for legibility; avoids arbitrary selection | +| dHash in schema | Not included | dHash is a generation-time implementation detail, not a stable output attribute | + +### Threshold Calibration + +The threshold of 10 / 64 bits is a conservative starting point from the +perceptual-hashing literature. Empirical calibration against real HeOCR corpus +scans is deferred to a future sub-PR once a labelled near-duplicate test set is +available. The constant `_DEDUP_HAMMING_THRESHOLD` in `generator.py` is the +single change point. + +--- + +## Schema Changes + +`variant` in `letter_set.schema.json` (Draft 2020-12): + +- `"quality"` added to the `required` array. +- `"quality"` object added with `"ink_ratio"` as the only required property + (`number`, range [0.0, 1.0]). +- `additionalProperties: false` on `quality` allows forward extension via + schema revision. + +--- + +## Public API additions (`extractor.py`) + +| Symbol | Type | Description | +|---|---|---| +| `compute_ink_ratio(binary, glyph)` | `float` | Ink fraction in [0.0, 1.0] | +| `compute_dhash(binary, glyph, *, hash_size=8)` | `int` | 64-bit dHash | +| `hamming_distance(a, b)` | `int` | Bit-wise Hamming distance | + +All three are exported via `__all__` and documented in the module docstring. + +--- + +## Known Limitations / Out of Scope + +- **Hard ink_ratio filtering** — the metric is recorded but no automatic + drop threshold is applied in M4. Consumers filter as needed. +- **Cross-writer dedup** — different writers may produce near-identical glyphs; + dedup is scoped per writer per letter. +- **Clustering quality** — greedy single-pass is order-dependent. A full + clustering (e.g. DBSCAN on 64-bit hash space) is deferred. +- **Nikud merging** — diacritical marks are still emitted as separate blobs; + merging with parent letter bodies deferred to M5. +- **dHash on colour images** — dHash operates on the binarised image; this is + correct for ink-on-white glyph crops but would need adjustment if the + pipeline were extended to colour channels. diff --git a/examples/letter_set/writer_example.json b/examples/letter_set/writer_example.json index c46e1fd..58ca41f 100644 --- a/examples/letter_set/writer_example.json +++ b/examples/letter_set/writer_example.json @@ -25,6 +25,7 @@ "asset_path": "letters/alef/alef-0001.png", "checksum_sha256": "0000000000000000000000000000000000000000000000000000000000000000", "image": { "width_px": 96, "height_px": 96, "format": "png" }, + "quality": { "ink_ratio": 0.31 }, "source": { "scan_entry_id": "example-scan-0001", "scan_url": "https://example.invalid/scans/example-scan-0001", @@ -39,6 +40,7 @@ "asset_path": "letters/alef/alef-0002.png", "checksum_sha256": "1111111111111111111111111111111111111111111111111111111111111111", "image": { "width_px": 92, "height_px": 100, "format": "png" }, + "quality": { "ink_ratio": 0.27 }, "source": { "scan_entry_id": "example-scan-0002", "license": "PDM-1.0", @@ -52,6 +54,7 @@ "asset_path": "letters/bet/bet-0001.png", "checksum_sha256": "2222222222222222222222222222222222222222222222222222222222222222", "image": { "width_px": 88, "height_px": 90, "format": "png" }, + "quality": { "ink_ratio": 0.22 }, "source": { "scan_entry_id": "example-scan-0001", "license": "CC-BY-SA-4.0", @@ -66,6 +69,7 @@ "asset_path": "letters/kaf-final/kaf-final-0001.png", "checksum_sha256": "3333333333333333333333333333333333333333333333333333333333333333", "image": { "width_px": 80, "height_px": 110, "format": "png" }, + "quality": { "ink_ratio": 0.18 }, "source": { "scan_entry_id": "example-scan-0002", "license": "PDM-1.0", diff --git a/pyproject.toml b/pyproject.toml index facc220..be3f9e3 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -33,6 +33,7 @@ lint = [ ] typecheck = [ "mypy>=1.10", + "numpy>=1.20", ] cv = [ "opencv-python-headless>=4.9", diff --git a/src/hletterscriptgen/extractor.py b/src/hletterscriptgen/extractor.py index be0c668..ba55792 100644 --- a/src/hletterscriptgen/extractor.py +++ b/src/hletterscriptgen/extractor.py @@ -18,6 +18,9 @@ * :class:`Glyph` — frozen dataclass for a detected blob's bounding box. * :func:`binarize_scan` — load a scan and return its Otsu-binarised array. * :func:`crop_binary` — crop a glyph from an already-binarised array. +* :func:`compute_ink_ratio` — fraction of ink pixels in a glyph bbox. +* :func:`compute_dhash` — 64-bit difference hash of a glyph crop. +* :func:`hamming_distance` — Hamming distance between two integer hashes. * :func:`extract_glyphs` — detect blobs in a scan via CCA. * :func:`crop_glyph` — convenience wrapper: binarize a scan and crop one blob. """ @@ -285,13 +288,97 @@ def crop_glyph(image_path: Path, glyph: Glyph) -> bytes: return crop_binary(binarize_scan(image_path), glyph) +def compute_ink_ratio(binary: Any, glyph: Glyph) -> float: + """Return the fraction of ink pixels in the glyph bbox. + + Counts foreground pixels (value > 0 in the binarised array) within + the bounding box and divides by the total number of pixels in the box. + This is pure NumPy arithmetic and does **not** require OpenCV. + + The result is in [0.0, 1.0]: + * Near 0.0 — almost-empty crop (noise / whitespace artefact). + * Near 1.0 — fully filled bbox (also suspicious — may be a ruled line + or bleed-through that passed the :data:`DEFAULT_MAX_AREA_FRACTION` + filter). + * 0.10-0.60 — typical range for legible Hebrew glyphs. + + Parameters + ---------- + binary: + A 2-D uint8 NumPy array produced by :func:`binarize_scan`. + glyph: + Bounding box whose ink ratio to compute. + """ + crop = binary[glyph.y : glyph.y + glyph.height, glyph.x : glyph.x + glyph.width] + ink_px = int((crop > 0).sum()) + return ink_px / (glyph.width * glyph.height) + + +def compute_dhash(binary: Any, glyph: Glyph, *, hash_size: int = 8) -> int: + """Compute a difference hash (dHash) for the glyph crop. + + Resizes the crop to ``(hash_size + 1) x hash_size`` pixels and encodes + horizontal pixel differences as a ``hash_size ** 2``-bit integer. + Identical or near-identical glyphs produce hashes with a low + :func:`hamming_distance`; hashes from visually distinct glyphs differ + by many bits. + + The default ``hash_size=8`` yields a 64-bit hash (8 x 8 difference bits) + — a good balance of sensitivity and collision resistance for glyph-sized + images. The bit layout matches ``np.packbits`` row-major order with MSB + first, so ``hash_size ** 2`` must be divisible by 8 for unambiguous + packing; only ``hash_size=8`` is supported in production. + + Parameters + ---------- + binary: + A 2-D uint8 NumPy array produced by :func:`binarize_scan`. + glyph: + Bounding box to hash. + hash_size: + Controls the hash width/height. The hash contains + ``hash_size ** 2`` bits. Defaults to 8. + + Raises + ------ + ExtractionError + When ``opencv-python-headless`` is not installed or ``hash_size`` < 1. + """ + if hash_size < 1: + raise ExtractionError(f"hash_size must be >= 1, got {hash_size}") + import numpy as np + + cv2 = _require_cv2() + crop = binary[glyph.y : glyph.y + glyph.height, glyph.x : glyph.x + glyph.width] + # Resize to (hash_size+1) columns x hash_size rows for horizontal differences. + small = cv2.resize(crop, (hash_size + 1, hash_size)) + # Vectorised: compare each pixel against its right neighbour in one shot. + diff = small[:, :-1] > small[:, 1:] # (hash_size, hash_size) bool array + # Pack bits MSB-first (np.packbits default) and interpret as a big-endian integer. + return int.from_bytes(np.packbits(diff).tobytes(), "big") + + +def hamming_distance(a: int, b: int) -> int: + """Return the Hamming distance (number of differing bits) between two hashes. + + Works on any non-negative integers; meaningful for hashes returned by + :func:`compute_dhash`. A distance of 0 means the hashes are identical; + ≤ 10 out of 64 bits indicates near-identical images at the default + ``hash_size=8``. + """ + return bin(a ^ b).count("1") + + __all__ = [ "DEFAULT_MAX_AREA_FRACTION", "MIN_GLYPH_PX", "ExtractionError", "Glyph", "binarize_scan", + "compute_dhash", + "compute_ink_ratio", "crop_binary", "crop_glyph", "extract_glyphs", + "hamming_distance", ] diff --git a/src/hletterscriptgen/generator.py b/src/hletterscriptgen/generator.py index 2aa46bc..cc57216 100644 --- a/src/hletterscriptgen/generator.py +++ b/src/hletterscriptgen/generator.py @@ -1,4 +1,4 @@ -"""End-to-end glyph extraction pipeline (M3 MVP). +"""End-to-end glyph extraction pipeline (M3/M4). Orchestrates the full generate flow: @@ -7,11 +7,13 @@ 3. For each writer → each scan → each annotated glyph: a. Look up the upstream entry; skip ineligible entries (warn). b. Resolve the scan file path from the upstream checkout. - c. Binarise the scan once, then crop each glyph from the binary array. - d. Write the PNG to the output tree. - e. Record a ``variant`` for the letter_set.v1 document. -4. Build and validate the ``letter_set.v1`` document for each writer. -5. Write ``letter_set.json`` to the writer's output directory. + c. Binarise the scan once, then crop/hash/measure each glyph. + d. Accumulate variant dicts in memory (PNG bytes held, not yet written). +4. Per letter: deduplicate near-duplicate variants by 64-bit dHash + (Hamming ≤ :data:`_DEDUP_HAMMING_THRESHOLD`); keep highest ``ink_ratio``. +5. Write only the surviving PNG assets to the output tree. +6. Build and validate the ``letter_set.v1`` document for each writer. +7. Write ``letter_set.json`` to the writer's output directory. Output tree structure:: @@ -45,7 +47,15 @@ from typing import Any from hletterscriptgen import HEBREW_LETTERS, __version__ -from hletterscriptgen.extractor import ExtractionError, Glyph, binarize_scan, crop_binary +from hletterscriptgen.extractor import ( + ExtractionError, + Glyph, + binarize_scan, + compute_dhash, + compute_ink_ratio, + crop_binary, + hamming_distance, +) from hletterscriptgen.generate_profile import ( GenerateProfile, GlyphAnnotation, @@ -125,6 +135,69 @@ def _resolve_scan_path( return None +# Hamming-distance threshold for near-duplicate dHash clustering. Two glyphs +# whose hashes differ by ≤ this many bits are considered near-duplicates; the +# one with the higher ink_ratio is kept. 10 / 64 bits ≈ 15 % of bits differ, +# which is the conventional loose threshold for perceptual-hash deduplication. +_DEDUP_HAMMING_THRESHOLD: int = 10 + + +def _dedup_letter_variants(variants: list[dict[str, Any]]) -> list[dict[str, Any]]: + """Return a deduplicated copy of one letter's candidate variant list. + + Clusters variants by 64-bit dHash using a greedy single-pass algorithm: + for each variant (in arrival order), check whether it falls within + :data:`_DEDUP_HAMMING_THRESHOLD` Hamming bits of any already-selected + representative. + + * If it **matches** an existing representative and has a **higher** + ``ink_ratio``, the representative's payload (variant_id, asset_path, + checksum, quality, source, png bytes, …) is replaced with the + candidate's — but the **cluster-centre hash** (the original + representative's ``_dhash``) is **preserved**. This prevents the + cluster centre from drifting across successive updates, which would + otherwise cause a chain A ≈ B ≈ C (but A ≁ C) to incorrectly absorb + C into A's cluster. + * If it **matches** but has an equal or lower ``ink_ratio``, the existing + representative is kept unchanged. + * If it does **not match** any representative, it starts a new cluster. + + Internal keys (``_dhash``, ``_png_bytes``) are **not** stripped here; + the caller is responsible for stripping them and writing PNG files after + this function returns. + + Parameters + ---------- + variants: + List of variant dicts, each carrying temporary ``_dhash`` (int) and + ``_png_bytes`` (bytes) keys alongside the schema-visible fields. + + Returns + ------- + list[dict[str, Any]] + Deduplicated list retaining all input keys (including ``_dhash`` and + ``_png_bytes``). + """ + representatives: list[dict[str, Any]] = [] + + for candidate in variants: + c_hash = candidate["_dhash"] + c_ink = candidate["quality"]["ink_ratio"] + matched = False + for rep in representatives: + if hamming_distance(c_hash, rep["_dhash"]) <= _DEDUP_HAMMING_THRESHOLD: + if c_ink > rep["quality"]["ink_ratio"]: + cluster_hash = rep["_dhash"] # preserve cluster centre + rep.update(candidate) + rep["_dhash"] = cluster_hash # restore after bulk update + matched = True + break + if not matched: + representatives.append(dict(candidate)) + + return representatives + + def _extract_variants( writer: WriterAnnotation, entry_index: dict[str, UpstreamEntry], @@ -133,18 +206,27 @@ def _extract_variants( generated_at: str, pending_warnings: list[str], ) -> tuple[dict[str, list[dict[str, Any]]], set[str], set[str]]: - """Crop all glyph variants for one writer and write PNG assets to disk. + """Crop, deduplicate, and write glyph variants for one writer. Returns ``(letters_map, observed_licenses, used_entry_ids)``. Non-fatal issues (missing entries, ineligible scans, crop failures) are appended to ``pending_warnings``. + Pipeline order within this function: + + 1. Accumulate all candidate variant dicts in memory, holding PNG bytes + under the temporary ``_png_bytes`` key (no disk writes yet). + 2. Deduplicate per letter via :func:`_dedup_letter_variants`. + 3. **Only then** write the surviving PNGs to disk, so that no files are + created for variants eliminated by dedup. + 4. Derive ``used_entry_ids`` and ``observed_licenses`` from the survivors, + so that entries or licenses contributed solely by deduped-out variants + are not listed in the output manifest. + The scan image is binarised once per scan file; all glyph crops for that scan share the same binary array, avoiding redundant I/O. """ letters_map: dict[str, list[dict[str, Any]]] = {} - observed_licenses: set[str] = set() - used_entry_ids: set[str] = set() for scan in writer.scans: entry = entry_index.get(scan.entry_id) @@ -196,8 +278,6 @@ def _extract_variants( ) continue - used_entry_ids.add(scan.entry_id) - for glyph_ann in scan.glyphs: if glyph_ann.letter not in HEBREW_LETTERS: pending_warnings.append( @@ -222,12 +302,15 @@ def _extract_variants( ) continue + ink = compute_ink_ratio(binary, glyph) + dhash = compute_dhash(binary, glyph) rel_path = _asset_path(scan.entry_id, glyph_ann.letter, glyph_ann) - out_file = writer_out_dir / rel_path - out_file.parent.mkdir(parents=True, exist_ok=True) - out_file.write_bytes(png_bytes) variant: dict[str, Any] = { + # Internal keys stripped after dedup + file write (not in schema). + "_dhash": dhash, + "_png_bytes": png_bytes, + # Schema-visible fields. "variant_id": _variant_id(scan.entry_id, glyph_ann.letter, glyph_ann), "asset_path": rel_path, "checksum_sha256": _sha256_hex(png_bytes), @@ -236,6 +319,9 @@ def _extract_variants( "height_px": glyph_ann.height, "format": "png", }, + "quality": { + "ink_ratio": ink, + }, "source": { "scan_entry_id": scan.entry_id, "license": license_expr, @@ -252,7 +338,42 @@ def _extract_variants( variant["notes"] = glyph_ann.notes letters_map.setdefault(glyph_ann.letter, []).append(variant) - observed_licenses.add(license_expr) + + # Dedup: collapse near-duplicate variants per letter (Hamming ≤ threshold). + # Warn when any are dropped so callers have visibility into data loss. + for letter in letters_map: + pre_count = len(letters_map[letter]) + letters_map[letter] = _dedup_letter_variants(letters_map[letter]) + dropped = pre_count - len(letters_map[letter]) + if dropped: + pending_warnings.append( + f"writer {writer.writer_id!r}: dropped {dropped} near-duplicate " + f"variant(s) for letter {letter!r} " + f"(dHash Hamming <= {_DEDUP_HAMMING_THRESHOLD})" + ) + + # Write only surviving variants to disk, then strip internal keys. + # Doing this after dedup guarantees no orphaned PNG files for eliminated variants. + for variants in letters_map.values(): + for variant in variants: + png = variant.pop("_png_bytes") + variant.pop("_dhash") + out_file = writer_out_dir / variant["asset_path"] + out_file.parent.mkdir(parents=True, exist_ok=True) + out_file.write_bytes(png) + + # Derive from survivors only: entries or licenses contributed solely by + # deduped-out variants must not appear in the manifest. + used_entry_ids: set[str] = { + v["source"]["scan_entry_id"] + for variants in letters_map.values() + for v in variants + } + observed_licenses: set[str] = { + v["source"]["license"] + for variants in letters_map.values() + for v in variants + } return letters_map, observed_licenses, used_entry_ids diff --git a/src/hletterscriptgen/schemas/letter_set.schema.json b/src/hletterscriptgen/schemas/letter_set.schema.json index 76febb4..14d8950 100644 --- a/src/hletterscriptgen/schemas/letter_set.schema.json +++ b/src/hletterscriptgen/schemas/letter_set.schema.json @@ -142,6 +142,7 @@ "asset_path", "checksum_sha256", "image", + "quality", "source" ], "properties": { @@ -172,6 +173,20 @@ } } }, + "quality": { + "type": "object", + "description": "Per-variant quality metrics computed from the binarised glyph crop during generation.", + "additionalProperties": false, + "required": ["ink_ratio"], + "properties": { + "ink_ratio": { + "type": "number", + "minimum": 0.0, + "maximum": 1.0, + "description": "Fraction of pixels in the bounding box that are foreground (ink) in the binarised image. 0.0 = empty crop, 1.0 = entirely filled. Typical values for legible Hebrew glyphs are 0.10–0.60." + } + } + }, "source": { "type": "object", "additionalProperties": false, diff --git a/tests/test_extractor.py b/tests/test_extractor.py index b62f384..a97a3d0 100644 --- a/tests/test_extractor.py +++ b/tests/test_extractor.py @@ -13,8 +13,12 @@ MIN_GLYPH_PX, ExtractionError, Glyph, + binarize_scan, + compute_dhash, + compute_ink_ratio, crop_glyph, extract_glyphs, + hamming_distance, ) # --------------------------------------------------------------------------- @@ -241,3 +245,126 @@ def test_crop_glyph_raises_on_out_of_bounds(tmp_path: Path) -> None: out_of_bounds = Glyph(x=40, y=40, width=20, height=20) # extends past 50x50 with pytest.raises(ExtractionError, match="outside"): crop_glyph(scan, out_of_bounds) + + +# --------------------------------------------------------------------------- +# compute_ink_ratio +# --------------------------------------------------------------------------- + + +def _binarize_array(img: np.ndarray) -> np.ndarray: # type: ignore[type-arg] + """Binarise a synthetic BGR image the same way binarize_scan does.""" + grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) + _, binary = cv2.threshold(grey, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU) + return binary + + +def test_ink_ratio_fully_filled() -> None: + """A glyph bbox that is entirely ink should have ratio 1.0.""" + img = _white_image(50, 50) + _draw_black_rect(img, x=0, y=0, w=50, h=50) # entire image is black + binary = _binarize_array(img) + glyph = Glyph(x=0, y=0, width=50, height=50) + ratio = compute_ink_ratio(binary, glyph) + assert ratio == pytest.approx(1.0, abs=1e-6) + + +def test_ink_ratio_empty_crop() -> None: + """A crop with no ink pixels should have ratio 0.0.""" + img = _white_image(50, 50) # entirely white — no ink + binary = _binarize_array(img) + glyph = Glyph(x=0, y=0, width=50, height=50) + ratio = compute_ink_ratio(binary, glyph) + assert ratio == pytest.approx(0.0, abs=1e-6) + + +def test_ink_ratio_partial_fill() -> None: + """Half-filled crop should yield ratio ≈ 0.5.""" + # 20x10 bbox: fill the top 10x10 half with black + img = _white_image(40, 20) + _draw_black_rect(img, x=0, y=0, w=10, h=10) + binary = _binarize_array(img) + glyph = Glyph(x=0, y=0, width=20, height=10) + ratio = compute_ink_ratio(binary, glyph) + # 10x10 ink in a 20x10 bbox → exactly 0.5 + assert ratio == pytest.approx(0.5, abs=1e-6) + + +def test_ink_ratio_is_in_unit_interval(tmp_path: Path) -> None: + """ink_ratio must always be in [0.0, 1.0] for any binary input.""" + img = _white_image(100, 100) + _draw_black_rect(img, x=10, y=10, w=30, h=30) + scan = tmp_path / "scan.png" + _save_png(img, scan) + binary = binarize_scan(scan) + glyph = Glyph(x=10, y=10, width=30, height=30) + ratio = compute_ink_ratio(binary, glyph) + assert 0.0 <= ratio <= 1.0 + + +# --------------------------------------------------------------------------- +# compute_dhash / hamming_distance +# --------------------------------------------------------------------------- + + +def test_dhash_returns_integer() -> None: + """compute_dhash must return a plain int.""" + img = _white_image(50, 50) + _draw_black_rect(img, x=5, y=5, w=20, h=20) + binary = _binarize_array(img) + glyph = Glyph(x=5, y=5, width=20, height=20) + h = compute_dhash(binary, glyph) + assert isinstance(h, int) + + +def test_dhash_identical_glyphs_have_zero_distance() -> None: + """The same crop hashed twice should produce identical hashes.""" + img = _white_image(60, 60) + _draw_black_rect(img, x=10, y=10, w=30, h=30) + binary = _binarize_array(img) + glyph = Glyph(x=10, y=10, width=30, height=30) + h1 = compute_dhash(binary, glyph) + h2 = compute_dhash(binary, glyph) + assert hamming_distance(h1, h2) == 0 + + +def test_dhash_different_glyphs_have_positive_distance() -> None: + """Two structurally different crops should produce hashes with distance > 0. + + Build the binary array directly (skip Otsu) so that glyph A has ink on + its left half and glyph B has ink on its right half. The resulting + horizontal-difference patterns are mirror images → hashes differ. + """ + binary = np.zeros((40, 100), dtype=np.uint8) + # Glyph A: left 20 px of a 40x30 bbox are ink, right 20 px are background. + binary[5:35, 0:20] = 255 + # Glyph B: right 20 px of a 40x30 bbox are ink, left 20 px are background. + binary[5:35, 80:100] = 255 + g_a = Glyph(x=0, y=5, width=40, height=30) + g_b = Glyph(x=60, y=5, width=40, height=30) + ha = compute_dhash(binary, g_a) + hb = compute_dhash(binary, g_b) + assert hamming_distance(ha, hb) > 0 + + +def test_hamming_distance_identical() -> None: + assert hamming_distance(0b1010, 0b1010) == 0 + + +def test_hamming_distance_all_differ() -> None: + """0x00 vs 0xFF for an 8-bit value should give distance 8.""" + assert hamming_distance(0x00, 0xFF) == 8 + + +def test_hamming_distance_single_bit() -> None: + assert hamming_distance(0b0001, 0b0000) == 1 + + +def test_dhash_default_is_64_bits() -> None: + """Default hash_size=8 should produce a value expressible in ≤ 64 bits.""" + img = _white_image(50, 50) + _draw_black_rect(img, x=5, y=5, w=20, h=20) + binary = _binarize_array(img) + glyph = Glyph(x=5, y=5, width=20, height=20) + h = compute_dhash(binary, glyph) + assert 0 <= h < (1 << 64) diff --git a/tests/test_generator.py b/tests/test_generator.py index d3c0e70..ba2b2da 100644 --- a/tests/test_generator.py +++ b/tests/test_generator.py @@ -5,6 +5,7 @@ import json import subprocess from pathlib import Path +from typing import Any from unittest.mock import MagicMock, patch import pytest @@ -13,7 +14,12 @@ import numpy as np # noqa: E402 from hletterscriptgen.generate_profile import load_generate_profile # noqa: E402 -from hletterscriptgen.generator import GeneratorError, generate # noqa: E402 +from hletterscriptgen.generator import ( # noqa: E402 + GeneratorError, + GeneratorWarning, + _dedup_letter_variants, + generate, +) # --------------------------------------------------------------------------- # Helpers — synthetic upstream checkout + scan @@ -343,11 +349,12 @@ def test_generate_document_structure_mocked(tmp_path: Path) -> None: output_dir = tmp_path / "out" profile = load_generate_profile(profile_path) - fake_binary = MagicMock() - with patch("hletterscriptgen.generator.binarize_scan", return_value=fake_binary): + with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()): with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG): - paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00") + with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3): + with patch("hletterscriptgen.generator.compute_dhash", return_value=0): + paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00") assert len(paths) == 1 doc = json.loads(paths[0].read_text(encoding="utf-8")) @@ -368,7 +375,9 @@ def test_generate_mocked_validates(tmp_path: Path) -> None: profile = load_generate_profile(profile_path) with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()): with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG): - paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00") + with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3): + with patch("hletterscriptgen.generator.compute_dhash", return_value=0): + paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00") doc = json.loads(paths[0].read_text(encoding="utf-8")) result = validate_document(doc) @@ -384,7 +393,9 @@ def test_generate_mocked_writes_png_assets(tmp_path: Path) -> None: profile = load_generate_profile(profile_path) with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()): with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG): - paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00") + with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3): + with patch("hletterscriptgen.generator.compute_dhash", return_value=0): + paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00") doc = json.loads(paths[0].read_text(encoding="utf-8")) writer_dir = paths[0].parent @@ -426,7 +437,9 @@ def test_generate_mocked_glyph_notes_in_variant(tmp_path: Path) -> None: profile = load_generate_profile(p) with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()): with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG): - paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00") + with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3): + with patch("hletterscriptgen.generator.compute_dhash", return_value=0): + paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00") doc = json.loads(paths[0].read_text(encoding="utf-8")) variant = doc["letters"]["א"][0] @@ -442,7 +455,356 @@ def test_generate_mocked_config_hash_in_document(tmp_path: Path) -> None: profile = load_generate_profile(profile_path) with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()): with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG): - paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00") + with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3): + with patch("hletterscriptgen.generator.compute_dhash", return_value=0): + paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00") doc = json.loads(paths[0].read_text(encoding="utf-8")) assert doc["generator"]["config_hash"] == profile.config_hash + + +# --------------------------------------------------------------------------- +# _dedup_letter_variants — isolated unit tests +# --------------------------------------------------------------------------- + +# Build a minimal variant dict that satisfies the function's expectations. +# Internal keys (_dhash, _png_bytes) must be present; schema keys are minimal. +def _v( + vid: str, + ink: float, + dhash: int, + *, + entry_id: str = "e1", + license_id: str = "PDM-1.0", +) -> dict[str, Any]: + return { + "_dhash": dhash, + "_png_bytes": b"\x89PNG stub", + "variant_id": vid, + "asset_path": f"glyphs/א/{vid}.png", + "checksum_sha256": "0" * 64, + "image": {"width_px": 20, "height_px": 20, "format": "png"}, + "quality": {"ink_ratio": ink}, + "source": { + "scan_entry_id": entry_id, + "license": license_id, + "bbox_in_source": {"x": 0, "y": 0, "width": 20, "height": 20}, + }, + "extracted_at": "2025-01-01T00:00:00+00:00", + } + + +def test_dedup_single_variant_returned_unchanged() -> None: + """A single variant survives dedup unmodified.""" + v = _v("v1", 0.3, dhash=0) + result = _dedup_letter_variants([v]) + assert len(result) == 1 + assert result[0]["variant_id"] == "v1" + assert result[0]["_dhash"] == 0 # caller strips; still present here + + +def test_dedup_identical_hashes_keeps_higher_ink_ratio() -> None: + """Two variants with Hamming distance 0 collapse to one; higher ink wins.""" + low = _v("v-low", ink=0.20, dhash=0) + high = _v("v-high", ink=0.45, dhash=0) + result = _dedup_letter_variants([low, high]) + assert len(result) == 1 + assert result[0]["quality"]["ink_ratio"] == pytest.approx(0.45) + assert result[0]["variant_id"] == "v-high" + + +def test_dedup_identical_hashes_first_wins_when_ink_equal_or_lower() -> None: + """When the representative already has better ink, it is not replaced.""" + first = _v("v-first", ink=0.50, dhash=7) + second = _v("v-second", ink=0.30, dhash=7) + result = _dedup_letter_variants([first, second]) + assert len(result) == 1 + assert result[0]["variant_id"] == "v-first" + + +def test_dedup_distinct_hashes_both_survive() -> None: + """Two variants with Hamming distance > threshold both survive.""" + a = _v("v-a", ink=0.3, dhash=0x0000_0000_0000_0000) + b = _v("v-b", ink=0.3, dhash=0xFFFF_FFFF_FFFF_FFFF) + result = _dedup_letter_variants([a, b]) + assert len(result) == 2 + + +def test_dedup_threshold_boundary_at_exactly_threshold() -> None: + """Hamming distance exactly equal to the threshold is still a near-dupe.""" + from hletterscriptgen.generator import _DEDUP_HAMMING_THRESHOLD + + # Build a hash that differs by exactly _DEDUP_HAMMING_THRESHOLD bits from 0. + border_hash = (1 << _DEDUP_HAMMING_THRESHOLD) - 1 # lowest N bits set + assert bin(border_hash).count("1") == _DEDUP_HAMMING_THRESHOLD + + a = _v("v-a", ink=0.3, dhash=0) + b = _v("v-b", ink=0.4, dhash=border_hash) + result = _dedup_letter_variants([a, b]) + assert len(result) == 1 + + +def test_dedup_threshold_boundary_one_over_survives() -> None: + """Hamming distance one above threshold is distinct; both variants survive.""" + from hletterscriptgen.generator import _DEDUP_HAMMING_THRESHOLD + + over_hash = (1 << (_DEDUP_HAMMING_THRESHOLD + 1)) - 1 # N+1 bits set + assert bin(over_hash).count("1") == _DEDUP_HAMMING_THRESHOLD + 1 + + a = _v("v-a", ink=0.3, dhash=0) + b = _v("v-b", ink=0.3, dhash=over_hash) + result = _dedup_letter_variants([a, b]) + assert len(result) == 2 + + +def test_dedup_cluster_centre_preserved_prevents_drift() -> None: + """Cluster centre hash must not drift when the representative is replaced. + + Scenario (hashes chosen so that |A-B| <= threshold but |A-C| > threshold, + yet |B-C| <= threshold): + + - A processed first → cluster centre = A_hash. + - B processed: near A → B replaces A (higher ink); centre stays A_hash. + - C processed: compared against A_hash (not B_hash). + |A-C| > threshold → C survives as a new cluster. + + Without the fix, C would be compared against B_hash, |B-C| <= threshold, + and C would be incorrectly absorbed. + """ + # A_hash = 0 (all zeros, 64 bits) + # B_hash = 0xFF (8 bits set): hamming(A,B) = 8 <= 10 (near-dupe) + # C_hash = 0xFFF (12 bits set): hamming(A,C) = 12 > 10 (distinct from A) + # but hamming(B,C) = hamming(0xFF, 0xFFF) = 4 <= 10 (near B) + A_hash = 0x00 + B_hash = 0xFF # 8 bits set; hamming(A,B)=8 <= 10 + C_hash = 0xFFF # 12 bits set; hamming(A,C)=12 > 10; hamming(B,C)=4 <= 10 + + a = _v("v-a", ink=0.20, dhash=A_hash) + b = _v("v-b", ink=0.40, dhash=B_hash) # better ink → replaces A in cluster + c = _v("v-c", ink=0.30, dhash=C_hash) # must survive as its own cluster + + result = _dedup_letter_variants([a, b, c]) + assert len(result) == 2, ( + "C should survive as a distinct cluster (|A-C|=12 > threshold), " + "but the cluster centre drifted to B_hash and absorbed C" + ) + ids = {r["variant_id"] for r in result} + assert "v-b" in ids # winner of first cluster + assert "v-c" in ids # distinct second cluster + + +def test_dedup_internal_keys_not_stripped() -> None: + """_dedup_letter_variants must NOT strip _dhash or _png_bytes — that is the caller's job.""" + v = _v("v1", 0.3, dhash=42) + result = _dedup_letter_variants([v]) + assert "_dhash" in result[0] + assert "_png_bytes" in result[0] + + +# --------------------------------------------------------------------------- +# M4: quality metrics +# --------------------------------------------------------------------------- + + +def test_generate_variant_has_quality_ink_ratio(tmp_path: Path) -> None: + """Every generated variant must carry a quality.ink_ratio in [0, 1].""" + upstream = _make_upstream_checkout(tmp_path) + profile_path = _make_profile(tmp_path, upstream) + output_dir = tmp_path / "out" + + profile = load_generate_profile(profile_path) + paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00") + + doc = json.loads(paths[0].read_text(encoding="utf-8")) + for variants in doc["letters"].values(): + for variant in variants: + assert "quality" in variant, f"variant {variant['variant_id']!r} missing 'quality'" + ink = variant["quality"]["ink_ratio"] + assert isinstance(ink, float), f"ink_ratio must be float, got {type(ink)}" + assert 0.0 <= ink <= 1.0, f"ink_ratio out of range: {ink}" + + +def test_generate_variant_ink_ratio_nonzero_for_solid_blobs(tmp_path: Path) -> None: + """Solid black blobs should produce an ink_ratio > 0.""" + upstream = _make_upstream_checkout(tmp_path) + profile_path = _make_profile(tmp_path, upstream) + output_dir = tmp_path / "out" + + profile = load_generate_profile(profile_path) + paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00") + + doc = json.loads(paths[0].read_text(encoding="utf-8")) + for variants in doc["letters"].values(): + for variant in variants: + assert variant["quality"]["ink_ratio"] > 0.0 + + +def test_generate_mocked_variant_has_quality(tmp_path: Path) -> None: + """Mock path: quality.ink_ratio must still be present and valid.""" + upstream = _make_upstream_checkout_no_cv2(tmp_path) + profile_path = _make_profile(tmp_path, upstream) + output_dir = tmp_path / "out" + + profile = load_generate_profile(profile_path) + with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()): + with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG): + with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.25): + with patch("hletterscriptgen.generator.compute_dhash", return_value=0): + paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00") + + doc = json.loads(paths[0].read_text(encoding="utf-8")) + for variants in doc["letters"].values(): + for variant in variants: + assert variant["quality"]["ink_ratio"] == pytest.approx(0.25) + + +# --------------------------------------------------------------------------- +# M4: near-duplicate deduplication +# --------------------------------------------------------------------------- + + +def test_generate_dedup_removes_near_duplicate(tmp_path: Path) -> None: + """Two glyphs with identical dHash (Hamming 0) should collapse to one.""" + upstream = _make_upstream_checkout_no_cv2(tmp_path) + + # Two glyphs annotated on the same entry, same letter — will get identical + # hashes because compute_dhash is patched to return the same value. + profile_data = { + "upstream_checkout": str(upstream), + "writers": [ + { + "writer_id": "writer_test_a", + "attribution_method": "manual_review", + "scans": [ + { + "entry_id": "test__writer_a__p0001", + "glyphs": [ + {"letter": "א", "x": 5, "y": 5, "width": 20, "height": 20}, + {"letter": "א", "x": 5, "y": 5, "width": 20, "height": 20}, + ], + } + ], + } + ], + } + p = tmp_path / "profile.json" + p.write_text(json.dumps(profile_data), encoding="utf-8") + + output_dir = tmp_path / "out" + profile = load_generate_profile(p) + + ts = "2025-01-01T00:00:00+00:00" + with pytest.warns(GeneratorWarning, match="near-duplicate"): + with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()): + with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG): + with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3): + with patch("hletterscriptgen.generator.compute_dhash", return_value=42): + paths = generate(profile, output_dir, generated_at=ts) + + doc = json.loads(paths[0].read_text(encoding="utf-8")) + # Both annotations had the same hash → dedup leaves exactly one variant + assert len(doc["letters"]["א"]) == 1 + + +def test_generate_dedup_keeps_higher_ink_ratio(tmp_path: Path) -> None: + """When two near-dupes differ in ink_ratio, the higher one survives.""" + upstream = _make_upstream_checkout_no_cv2(tmp_path) + + profile_data = { + "upstream_checkout": str(upstream), + "writers": [ + { + "writer_id": "writer_test_a", + "attribution_method": "manual_review", + "scans": [ + { + "entry_id": "test__writer_a__p0001", + "glyphs": [ + {"letter": "א", "x": 5, "y": 5, "width": 20, "height": 20}, + {"letter": "א", "x": 5, "y": 5, "width": 20, "height": 20}, + ], + } + ], + } + ], + } + p = tmp_path / "profile.json" + p.write_text(json.dumps(profile_data), encoding="utf-8") + + output_dir = tmp_path / "out" + profile = load_generate_profile(p) + + # First call returns 0.20, second returns 0.45 (better) + ink_side_effect = [0.20, 0.45] + + ts = "2025-01-01T00:00:00+00:00" + with pytest.warns(GeneratorWarning, match="near-duplicate"): + with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()): + with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG): + with patch("hletterscriptgen.generator.compute_ink_ratio", side_effect=ink_side_effect): # noqa: E501 + with patch("hletterscriptgen.generator.compute_dhash", return_value=0): + paths = generate(profile, output_dir, generated_at=ts) + + doc = json.loads(paths[0].read_text(encoding="utf-8")) + assert len(doc["letters"]["א"]) == 1 + assert doc["letters"]["א"][0]["quality"]["ink_ratio"] == pytest.approx(0.45) + + +def test_generate_dedup_keeps_distinct_glyphs(tmp_path: Path) -> None: + """Two glyphs with Hamming distance > threshold must both survive dedup.""" + upstream = _make_upstream_checkout_no_cv2(tmp_path) + profile_data = { + "upstream_checkout": str(upstream), + "writers": [ + { + "writer_id": "writer_test_a", + "attribution_method": "manual_review", + "scans": [ + { + "entry_id": "test__writer_a__p0001", + "glyphs": [ + {"letter": "א", "x": 5, "y": 5, "width": 20, "height": 20}, + {"letter": "א", "x": 35, "y": 5, "width": 20, "height": 20}, + ], + } + ], + } + ], + } + p = tmp_path / "profile.json" + p.write_text(json.dumps(profile_data), encoding="utf-8") + + output_dir = tmp_path / "out" + profile = load_generate_profile(p) + + # Hashes differ by >> 10 bits → both survive + with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()): + with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG): + with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3): + with patch( + "hletterscriptgen.generator.compute_dhash", + side_effect=[0x0000_0000_0000_0000, 0xFFFF_FFFF_FFFF_FFFF], + ): + paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00") + + doc = json.loads(paths[0].read_text(encoding="utf-8")) + assert len(doc["letters"]["א"]) == 2 + + +def test_generate_variant_no_dhash_in_output(tmp_path: Path) -> None: + """The internal _dhash key must not leak into the written letter_set.json.""" + upstream = _make_upstream_checkout_no_cv2(tmp_path) + profile_path = _make_profile(tmp_path, upstream) + output_dir = tmp_path / "out" + + profile = load_generate_profile(profile_path) + with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()): + with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG): + with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3): + with patch("hletterscriptgen.generator.compute_dhash", return_value=0): + paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00") + + doc = json.loads(paths[0].read_text(encoding="utf-8")) + for variants in doc["letters"].values(): + for variant in variants: + assert "_dhash" not in variant