Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 155 additions & 0 deletions docs/design/quality-and-dedupe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# Quality Metrics and Near-Duplicate Deduplication (M4)

**Status:** Implemented
**Milestone:** M4
**Issue:** #22
**Date:** 2026-05-25

---

## Motivation

The CCA extractor (M3) annotates every connected component that passes the
size and area filters. Two failure modes degrade downstream HTR training:

1. **Low-quality crops** — crops with very little ink (noise blobs, stray
marks, artefacts) or fully filled bboxes (ruled lines, bleed-through).
2. **Near-duplicate variants** — the same physical letter glyph annotated more
than once, or different crops of the same ink stroke from overlapping
bounding boxes. Near-dupes inflate the variant count without adding visual
diversity and can bias classifiers toward specific writers.

M4 addresses both by embedding a per-variant quality metric in the output
schema and running a greedy near-duplicate dedup pass before emitting
`letter_set.json`.

---

## Quality Metric: `ink_ratio`

### Definition

```
ink_ratio = count(foreground pixels in bbox) / (bbox_width × bbox_height)
```

Foreground pixels are those with value > 0 in the Otsu-binarised image
(i.e., ink pixels after `THRESH_BINARY_INV`).

### Rationale

- **Computable with no extra dependencies** — the binarised array is already
in memory from the crop step.
- **Interpretable** — maps directly onto visual ink density.
- **Actionable** — consumers can threshold on `ink_ratio` to discard near-empty
or fully-filled crops without re-processing scans.
- **Typical range for legible Hebrew glyphs** — 0.10-0.60; outside this band
is a quality signal (not a hard filter in M4, left to consumers).

### Implementation

`compute_ink_ratio(binary, glyph)` in `extractor.py`:

```python
crop = binary[glyph.y : glyph.y + glyph.height, glyph.x : glyph.x + glyph.width]
ink_px = int((crop > 0).sum())
return ink_px / (glyph.width * glyph.height)
```

The value is embedded in the `quality` sub-object of every variant in
`letter_set.json` and validated against the updated `letter_set.schema.json`
(where `quality` is now a **required** field on `variant`).

---

## Near-Duplicate Deduplication: dHash + Hamming Distance

### Algorithm

**Step 1 — Compute a 64-bit difference hash (dHash) per crop.**

dHash is a perceptual hash that captures the gradient structure of an image.
For each glyph crop:

1. Resize the crop to `(hash_size + 1) x hash_size` pixels using bilinear
interpolation (`hash_size = 8` by default → 9 x 8 = 72 pixels).
2. Compare adjacent pixels in each row: for column `c` in row `r`, set bit 1
if `pixel[r, c] > pixel[r, c+1]`, else 0.
3. Pack all `hash_size²` (64 at default) bits into a single Python `int`.

**Step 2 — Greedy single-pass clustering per letter.**

For each annotated letter (e.g. `'א'`), process variants in arrival order:

- Compare the candidate's dHash against every already-selected
representative using Hamming distance.
- If the minimum Hamming distance is ≤ `_DEDUP_HAMMING_THRESHOLD` (10),
the candidate is a near-duplicate of that representative:
- Keep whichever has the higher `ink_ratio`.
- If no representative is within threshold, add the candidate as a new
representative.

**Step 3 — Strip the internal `_dhash` key before writing.**

dHash is used only during generation; it is not part of the schema and is
removed from every variant dict before `letter_set.json` is written.

### Design Decisions

| Decision | Choice | Rationale |
|---|---|---|
| Hash algorithm | dHash | Pure OpenCV (no extra deps); fast; well-suited for binary glyph images |
| Hash size | 8 (64 bits) | Standard; good sensitivity/collision balance for glyph-sized images |
| Hamming threshold | 10 / 64 bits (~15 %) | Conventional loose threshold for perceptual hashing; empirical calibration deferred |
| Dedup scope | Per letter, per writer | Cross-letter dedup is out of scope; cross-writer dedup is a downstream concern |
| Clustering algorithm | Greedy single-pass | O(n²) per letter, but n is small (< 20 variants/letter per writer in practice) |
| Tie-breaking | Higher `ink_ratio` | Proxy for legibility; avoids arbitrary selection |
| dHash in schema | Not included | dHash is a generation-time implementation detail, not a stable output attribute |

### Threshold Calibration

The threshold of 10 / 64 bits is a conservative starting point from the
perceptual-hashing literature. Empirical calibration against real HeOCR corpus
scans is deferred to a future sub-PR once a labelled near-duplicate test set is
available. The constant `_DEDUP_HAMMING_THRESHOLD` in `generator.py` is the
single change point.

---

## Schema Changes

`variant` in `letter_set.schema.json` (Draft 2020-12):

- `"quality"` added to the `required` array.
- `"quality"` object added with `"ink_ratio"` as the only required property
(`number`, range [0.0, 1.0]).
- `additionalProperties: false` on `quality` allows forward extension via
schema revision.

---

## Public API additions (`extractor.py`)

| Symbol | Type | Description |
|---|---|---|
| `compute_ink_ratio(binary, glyph)` | `float` | Ink fraction in [0.0, 1.0] |
| `compute_dhash(binary, glyph, *, hash_size=8)` | `int` | 64-bit dHash |
| `hamming_distance(a, b)` | `int` | Bit-wise Hamming distance |

All three are exported via `__all__` and documented in the module docstring.

---

## Known Limitations / Out of Scope

- **Hard ink_ratio filtering** — the metric is recorded but no automatic
drop threshold is applied in M4. Consumers filter as needed.
- **Cross-writer dedup** — different writers may produce near-identical glyphs;
dedup is scoped per writer per letter.
- **Clustering quality** — greedy single-pass is order-dependent. A full
clustering (e.g. DBSCAN on 64-bit hash space) is deferred.
- **Nikud merging** — diacritical marks are still emitted as separate blobs;
merging with parent letter bodies deferred to M5.
- **dHash on colour images** — dHash operates on the binarised image; this is
correct for ink-on-white glyph crops but would need adjustment if the
pipeline were extended to colour channels.
4 changes: 4 additions & 0 deletions examples/letter_set/writer_example.json
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
"asset_path": "letters/alef/alef-0001.png",
"checksum_sha256": "0000000000000000000000000000000000000000000000000000000000000000",
"image": { "width_px": 96, "height_px": 96, "format": "png" },
"quality": { "ink_ratio": 0.31 },
"source": {
"scan_entry_id": "example-scan-0001",
"scan_url": "https://example.invalid/scans/example-scan-0001",
Expand All @@ -39,6 +40,7 @@
"asset_path": "letters/alef/alef-0002.png",
"checksum_sha256": "1111111111111111111111111111111111111111111111111111111111111111",
"image": { "width_px": 92, "height_px": 100, "format": "png" },
"quality": { "ink_ratio": 0.27 },
"source": {
"scan_entry_id": "example-scan-0002",
"license": "PDM-1.0",
Expand All @@ -52,6 +54,7 @@
"asset_path": "letters/bet/bet-0001.png",
"checksum_sha256": "2222222222222222222222222222222222222222222222222222222222222222",
"image": { "width_px": 88, "height_px": 90, "format": "png" },
"quality": { "ink_ratio": 0.22 },
"source": {
"scan_entry_id": "example-scan-0001",
"license": "CC-BY-SA-4.0",
Expand All @@ -66,6 +69,7 @@
"asset_path": "letters/kaf-final/kaf-final-0001.png",
"checksum_sha256": "3333333333333333333333333333333333333333333333333333333333333333",
"image": { "width_px": 80, "height_px": 110, "format": "png" },
"quality": { "ink_ratio": 0.18 },
"source": {
"scan_entry_id": "example-scan-0002",
"license": "PDM-1.0",
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ lint = [
]
typecheck = [
"mypy>=1.10",
"numpy>=1.20",
]
cv = [
"opencv-python-headless>=4.9",
Expand Down
87 changes: 87 additions & 0 deletions src/hletterscriptgen/extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@
* :class:`Glyph` — frozen dataclass for a detected blob's bounding box.
* :func:`binarize_scan` — load a scan and return its Otsu-binarised array.
* :func:`crop_binary` — crop a glyph from an already-binarised array.
* :func:`compute_ink_ratio` — fraction of ink pixels in a glyph bbox.
* :func:`compute_dhash` — 64-bit difference hash of a glyph crop.
* :func:`hamming_distance` — Hamming distance between two integer hashes.
* :func:`extract_glyphs` — detect blobs in a scan via CCA.
* :func:`crop_glyph` — convenience wrapper: binarize a scan and crop one blob.
"""
Expand Down Expand Up @@ -285,13 +288,97 @@ def crop_glyph(image_path: Path, glyph: Glyph) -> bytes:
return crop_binary(binarize_scan(image_path), glyph)


def compute_ink_ratio(binary: Any, glyph: Glyph) -> float:
"""Return the fraction of ink pixels in the glyph bbox.

Counts foreground pixels (value > 0 in the binarised array) within
the bounding box and divides by the total number of pixels in the box.
This is pure NumPy arithmetic and does **not** require OpenCV.

The result is in [0.0, 1.0]:
* Near 0.0 — almost-empty crop (noise / whitespace artefact).
* Near 1.0 — fully filled bbox (also suspicious — may be a ruled line
or bleed-through that passed the :data:`DEFAULT_MAX_AREA_FRACTION`
filter).
* 0.10-0.60 — typical range for legible Hebrew glyphs.

Parameters
----------
binary:
A 2-D uint8 NumPy array produced by :func:`binarize_scan`.
glyph:
Bounding box whose ink ratio to compute.
"""
crop = binary[glyph.y : glyph.y + glyph.height, glyph.x : glyph.x + glyph.width]
ink_px = int((crop > 0).sum())
return ink_px / (glyph.width * glyph.height)


def compute_dhash(binary: Any, glyph: Glyph, *, hash_size: int = 8) -> int:
"""Compute a difference hash (dHash) for the glyph crop.

Resizes the crop to ``(hash_size + 1) x hash_size`` pixels and encodes
horizontal pixel differences as a ``hash_size ** 2``-bit integer.
Identical or near-identical glyphs produce hashes with a low
:func:`hamming_distance`; hashes from visually distinct glyphs differ
by many bits.

The default ``hash_size=8`` yields a 64-bit hash (8 x 8 difference bits)
— a good balance of sensitivity and collision resistance for glyph-sized
images. The bit layout matches ``np.packbits`` row-major order with MSB
first, so ``hash_size ** 2`` must be divisible by 8 for unambiguous
packing; only ``hash_size=8`` is supported in production.

Parameters
----------
binary:
A 2-D uint8 NumPy array produced by :func:`binarize_scan`.
glyph:
Bounding box to hash.
hash_size:
Controls the hash width/height. The hash contains
``hash_size ** 2`` bits. Defaults to 8.

Raises
------
ExtractionError
When ``opencv-python-headless`` is not installed or ``hash_size`` < 1.
"""
if hash_size < 1:
raise ExtractionError(f"hash_size must be >= 1, got {hash_size}")
import numpy as np

cv2 = _require_cv2()
crop = binary[glyph.y : glyph.y + glyph.height, glyph.x : glyph.x + glyph.width]
# Resize to (hash_size+1) columns x hash_size rows for horizontal differences.
small = cv2.resize(crop, (hash_size + 1, hash_size))
# Vectorised: compare each pixel against its right neighbour in one shot.
diff = small[:, :-1] > small[:, 1:] # (hash_size, hash_size) bool array
# Pack bits MSB-first (np.packbits default) and interpret as a big-endian integer.
return int.from_bytes(np.packbits(diff).tobytes(), "big")


def hamming_distance(a: int, b: int) -> int:
"""Return the Hamming distance (number of differing bits) between two hashes.

Works on any non-negative integers; meaningful for hashes returned by
:func:`compute_dhash`. A distance of 0 means the hashes are identical;
≤ 10 out of 64 bits indicates near-identical images at the default
``hash_size=8``.
"""
return bin(a ^ b).count("1")


__all__ = [
"DEFAULT_MAX_AREA_FRACTION",
"MIN_GLYPH_PX",
"ExtractionError",
"Glyph",
"binarize_scan",
"compute_dhash",
"compute_ink_ratio",
"crop_binary",
"crop_glyph",
"extract_glyphs",
"hamming_distance",
]
Loading