HeOCR · shaypal5 · May 25, 2026 · May 24, 2026 · May 24, 2026 · May 25, 2026
@@ -0,0 +1,155 @@
+# Quality Metrics and Near-Duplicate Deduplication (M4)
+
+**Status:** Implemented  
+**Milestone:** M4  
+**Issue:** #22  
+**Date:** 2026-05-25
+
+---
+
+## Motivation
+
+The CCA extractor (M3) annotates every connected component that passes the
+size and area filters.  Two failure modes degrade downstream HTR training:
+
+1. **Low-quality crops** — crops with very little ink (noise blobs, stray
+   marks, artefacts) or fully filled bboxes (ruled lines, bleed-through).
+2. **Near-duplicate variants** — the same physical letter glyph annotated more
+   than once, or different crops of the same ink stroke from overlapping
+   bounding boxes.  Near-dupes inflate the variant count without adding visual
+   diversity and can bias classifiers toward specific writers.
+
+M4 addresses both by embedding a per-variant quality metric in the output
+schema and running a greedy near-duplicate dedup pass before emitting
+`letter_set.json`.
+
+---
+
+## Quality Metric: `ink_ratio`
+
+### Definition
+
+```
+ink_ratio = count(foreground pixels in bbox) / (bbox_width × bbox_height)
+```
+
+Foreground pixels are those with value > 0 in the Otsu-binarised image
+(i.e., ink pixels after `THRESH_BINARY_INV`).
+
+### Rationale
+
+- **Computable with no extra dependencies** — the binarised array is already
+  in memory from the crop step.
+- **Interpretable** — maps directly onto visual ink density.
+- **Actionable** — consumers can threshold on `ink_ratio` to discard near-empty
+  or fully-filled crops without re-processing scans.
+- **Typical range for legible Hebrew glyphs** — 0.10-0.60; outside this band
+  is a quality signal (not a hard filter in M4, left to consumers).
+
+### Implementation
+
+`compute_ink_ratio(binary, glyph)` in `extractor.py`:
+
+```python
+crop = binary[glyph.y : glyph.y + glyph.height, glyph.x : glyph.x + glyph.width]
+ink_px = int((crop > 0).sum())
+return ink_px / (glyph.width * glyph.height)
+```
+
+The value is embedded in the `quality` sub-object of every variant in
+`letter_set.json` and validated against the updated `letter_set.schema.json`
+(where `quality` is now a **required** field on `variant`).
+
+---
+
+## Near-Duplicate Deduplication: dHash + Hamming Distance
+
+### Algorithm
+
+**Step 1 — Compute a 64-bit difference hash (dHash) per crop.**
+
+dHash is a perceptual hash that captures the gradient structure of an image.
+For each glyph crop:
+
+1. Resize the crop to `(hash_size + 1) x hash_size` pixels using bilinear
+   interpolation (`hash_size = 8` by default → 9 x 8 = 72 pixels).
+2. Compare adjacent pixels in each row: for column `c` in row `r`, set bit 1
+   if `pixel[r, c] > pixel[r, c+1]`, else 0.
+3. Pack all `hash_size²` (64 at default) bits into a single Python `int`.
+
+**Step 2 — Greedy single-pass clustering per letter.**
+
+For each annotated letter (e.g. `'א'`), process variants in arrival order:
+
+- Compare the candidate's dHash against every already-selected
+  representative using Hamming distance.
+- If the minimum Hamming distance is ≤ `_DEDUP_HAMMING_THRESHOLD` (10),
+  the candidate is a near-duplicate of that representative:
+  - Keep whichever has the higher `ink_ratio`.
+- If no representative is within threshold, add the candidate as a new
+  representative.
+
+**Step 3 — Strip the internal `_dhash` key before writing.**
+
+dHash is used only during generation; it is not part of the schema and is
+removed from every variant dict before `letter_set.json` is written.
+
+### Design Decisions
+
+| Decision | Choice | Rationale |
+|---|---|---|
+| Hash algorithm | dHash | Pure OpenCV (no extra deps); fast; well-suited for binary glyph images |
+| Hash size | 8 (64 bits) | Standard; good sensitivity/collision balance for glyph-sized images |
+| Hamming threshold | 10 / 64 bits (~15 %) | Conventional loose threshold for perceptual hashing; empirical calibration deferred |
+| Dedup scope | Per letter, per writer | Cross-letter dedup is out of scope; cross-writer dedup is a downstream concern |
+| Clustering algorithm | Greedy single-pass | O(n²) per letter, but n is small (< 20 variants/letter per writer in practice) |
+| Tie-breaking | Higher `ink_ratio` | Proxy for legibility; avoids arbitrary selection |
+| dHash in schema | Not included | dHash is a generation-time implementation detail, not a stable output attribute |
+
+### Threshold Calibration
+
+The threshold of 10 / 64 bits is a conservative starting point from the
+perceptual-hashing literature.  Empirical calibration against real HeOCR corpus
+scans is deferred to a future sub-PR once a labelled near-duplicate test set is
+available.  The constant `_DEDUP_HAMMING_THRESHOLD` in `generator.py` is the
+single change point.
+
+---
+
+## Schema Changes
+
+`variant` in `letter_set.schema.json` (Draft 2020-12):
+
+- `"quality"` added to the `required` array.
+- `"quality"` object added with `"ink_ratio"` as the only required property
+  (`number`, range [0.0, 1.0]).
+- `additionalProperties: false` on `quality` allows forward extension via
+  schema revision.
+
+---
+
+## Public API additions (`extractor.py`)
+
+| Symbol | Type | Description |
+|---|---|---|
+| `compute_ink_ratio(binary, glyph)` | `float` | Ink fraction in [0.0, 1.0] |
+| `compute_dhash(binary, glyph, *, hash_size=8)` | `int` | 64-bit dHash |
+| `hamming_distance(a, b)` | `int` | Bit-wise Hamming distance |
+
+All three are exported via `__all__` and documented in the module docstring.
+
+---
+
+## Known Limitations / Out of Scope
+
+- **Hard ink_ratio filtering** — the metric is recorded but no automatic
+  drop threshold is applied in M4.  Consumers filter as needed.
+- **Cross-writer dedup** — different writers may produce near-identical glyphs;
+  dedup is scoped per writer per letter.
+- **Clustering quality** — greedy single-pass is order-dependent.  A full
+  clustering (e.g. DBSCAN on 64-bit hash space) is deferred.
+- **Nikud merging** — diacritical marks are still emitted as separate blobs;
+  merging with parent letter bodies deferred to M5.
+- **dHash on colour images** — dHash operates on the binarised image; this is
+  correct for ink-on-white glyph crops but would need adjustment if the
+  pipeline were extended to colour channels.
@@ -25,6 +25,7 @@
         "asset_path": "letters/alef/alef-0001.png",
         "checksum_sha256": "0000000000000000000000000000000000000000000000000000000000000000",
         "image": { "width_px": 96, "height_px": 96, "format": "png" },
+        "quality": { "ink_ratio": 0.31 },
         "source": {
           "scan_entry_id": "example-scan-0001",
           "scan_url": "https://example.invalid/scans/example-scan-0001",
@@ -39,6 +40,7 @@
         "asset_path": "letters/alef/alef-0002.png",
         "checksum_sha256": "1111111111111111111111111111111111111111111111111111111111111111",
         "image": { "width_px": 92, "height_px": 100, "format": "png" },
+        "quality": { "ink_ratio": 0.27 },
         "source": {
           "scan_entry_id": "example-scan-0002",
           "license": "PDM-1.0",
@@ -52,6 +54,7 @@
         "asset_path": "letters/bet/bet-0001.png",
         "checksum_sha256": "2222222222222222222222222222222222222222222222222222222222222222",
         "image": { "width_px": 88, "height_px": 90, "format": "png" },
+        "quality": { "ink_ratio": 0.22 },
         "source": {
           "scan_entry_id": "example-scan-0001",
           "license": "CC-BY-SA-4.0",
@@ -66,6 +69,7 @@
         "asset_path": "letters/kaf-final/kaf-final-0001.png",
         "checksum_sha256": "3333333333333333333333333333333333333333333333333333333333333333",
         "image": { "width_px": 80, "height_px": 110, "format": "png" },
+        "quality": { "ink_ratio": 0.18 },
         "source": {
           "scan_entry_id": "example-scan-0002",
           "license": "PDM-1.0",

@@ -33,6 +33,7 @@ lint = [
 ]
 typecheck = [
   "mypy>=1.10",
+  "numpy>=1.20",
 ]
 cv = [
   "opencv-python-headless>=4.9",

@@ -18,6 +18,9 @@
 * :class:`Glyph` — frozen dataclass for a detected blob's bounding box.
 * :func:`binarize_scan` — load a scan and return its Otsu-binarised array.
 * :func:`crop_binary` — crop a glyph from an already-binarised array.
+* :func:`compute_ink_ratio` — fraction of ink pixels in a glyph bbox.
+* :func:`compute_dhash` — 64-bit difference hash of a glyph crop.
+* :func:`hamming_distance` — Hamming distance between two integer hashes.
 * :func:`extract_glyphs` — detect blobs in a scan via CCA.
 * :func:`crop_glyph` — convenience wrapper: binarize a scan and crop one blob.
 """
@@ -285,13 +288,97 @@ def crop_glyph(image_path: Path, glyph: Glyph) -> bytes:
     return crop_binary(binarize_scan(image_path), glyph)
 
 
+def compute_ink_ratio(binary: Any, glyph: Glyph) -> float:
+    """Return the fraction of ink pixels in the glyph bbox.
+
+    Counts foreground pixels (value > 0 in the binarised array) within
+    the bounding box and divides by the total number of pixels in the box.
+    This is pure NumPy arithmetic and does **not** require OpenCV.
+
+    The result is in [0.0, 1.0]:
+    * Near 0.0 — almost-empty crop (noise / whitespace artefact).
+    * Near 1.0 — fully filled bbox (also suspicious — may be a ruled line
+      or bleed-through that passed the :data:`DEFAULT_MAX_AREA_FRACTION`
+      filter).
+    * 0.10-0.60 — typical range for legible Hebrew glyphs.
+
+    Parameters
+    ----------
+    binary:
+        A 2-D uint8 NumPy array produced by :func:`binarize_scan`.
+    glyph:
+        Bounding box whose ink ratio to compute.
+    """
+    crop = binary[glyph.y : glyph.y + glyph.height, glyph.x : glyph.x + glyph.width]
+    ink_px = int((crop > 0).sum())
+    return ink_px / (glyph.width * glyph.height)
+
+
+def compute_dhash(binary: Any, glyph: Glyph, *, hash_size: int = 8) -> int:
+    """Compute a difference hash (dHash) for the glyph crop.
+
+    Resizes the crop to ``(hash_size + 1) x hash_size`` pixels and encodes
+    horizontal pixel differences as a ``hash_size ** 2``-bit integer.
+    Identical or near-identical glyphs produce hashes with a low
+    :func:`hamming_distance`; hashes from visually distinct glyphs differ
+    by many bits.
+
+    The default ``hash_size=8`` yields a 64-bit hash (8 x 8 difference bits)
+    — a good balance of sensitivity and collision resistance for glyph-sized
+    images.  The bit layout matches ``np.packbits`` row-major order with MSB
+    first, so ``hash_size ** 2`` must be divisible by 8 for unambiguous
+    packing; only ``hash_size=8`` is supported in production.
+
+    Parameters
+    ----------
+    binary:
+        A 2-D uint8 NumPy array produced by :func:`binarize_scan`.
+    glyph:
+        Bounding box to hash.
+    hash_size:
+        Controls the hash width/height.  The hash contains
+        ``hash_size ** 2`` bits.  Defaults to 8.
+
+    Raises
+    ------
+    ExtractionError
+        When ``opencv-python-headless`` is not installed or ``hash_size`` < 1.
+    """
+    if hash_size < 1:
+        raise ExtractionError(f"hash_size must be >= 1, got {hash_size}")
+    import numpy as np
+
+    cv2 = _require_cv2()
+    crop = binary[glyph.y : glyph.y + glyph.height, glyph.x : glyph.x + glyph.width]
+    # Resize to (hash_size+1) columns x hash_size rows for horizontal differences.
+    small = cv2.resize(crop, (hash_size + 1, hash_size))
+    # Vectorised: compare each pixel against its right neighbour in one shot.
+    diff = small[:, :-1] > small[:, 1:]  # (hash_size, hash_size) bool array
+    # Pack bits MSB-first (np.packbits default) and interpret as a big-endian integer.
+    return int.from_bytes(np.packbits(diff).tobytes(), "big")
+
+
+def hamming_distance(a: int, b: int) -> int:
+    """Return the Hamming distance (number of differing bits) between two hashes.
+
+    Works on any non-negative integers; meaningful for hashes returned by
+    :func:`compute_dhash`.  A distance of 0 means the hashes are identical;
+    ≤ 10 out of 64 bits indicates near-identical images at the default
+    ``hash_size=8``.
+    """
+    return bin(a ^ b).count("1")
+
+
 __all__ = [
     "DEFAULT_MAX_AREA_FRACTION",
     "MIN_GLYPH_PX",
     "ExtractionError",
     "Glyph",
     "binarize_scan",
+    "compute_dhash",
+    "compute_ink_ratio",
     "crop_binary",
     "crop_glyph",
     "extract_glyphs",
+    "hamming_distance",
 ]