diff --git a/docs/design/quality-and-dedupe.md b/docs/design/quality-and-dedupe.md
new file mode 100644
index 0000000..1c8f89c
--- /dev/null
+++ b/docs/design/quality-and-dedupe.md
@@ -0,0 +1,155 @@
+# Quality Metrics and Near-Duplicate Deduplication (M4)
+
+**Status:** Implemented  
+**Milestone:** M4  
+**Issue:** #22  
+**Date:** 2026-05-25
+
+---
+
+## Motivation
+
+The CCA extractor (M3) annotates every connected component that passes the
+size and area filters.  Two failure modes degrade downstream HTR training:
+
+1. **Low-quality crops** — crops with very little ink (noise blobs, stray
+   marks, artefacts) or fully filled bboxes (ruled lines, bleed-through).
+2. **Near-duplicate variants** — the same physical letter glyph annotated more
+   than once, or different crops of the same ink stroke from overlapping
+   bounding boxes.  Near-dupes inflate the variant count without adding visual
+   diversity and can bias classifiers toward specific writers.
+
+M4 addresses both by embedding a per-variant quality metric in the output
+schema and running a greedy near-duplicate dedup pass before emitting
+`letter_set.json`.
+
+---
+
+## Quality Metric: `ink_ratio`
+
+### Definition
+
+```
+ink_ratio = count(foreground pixels in bbox) / (bbox_width × bbox_height)
+```
+
+Foreground pixels are those with value > 0 in the Otsu-binarised image
+(i.e., ink pixels after `THRESH_BINARY_INV`).
+
+### Rationale
+
+- **Computable with no extra dependencies** — the binarised array is already
+  in memory from the crop step.
+- **Interpretable** — maps directly onto visual ink density.
+- **Actionable** — consumers can threshold on `ink_ratio` to discard near-empty
+  or fully-filled crops without re-processing scans.
+- **Typical range for legible Hebrew glyphs** — 0.10-0.60; outside this band
+  is a quality signal (not a hard filter in M4, left to consumers).
+
+### Implementation
+
+`compute_ink_ratio(binary, glyph)` in `extractor.py`:
+
+```python
+crop = binary[glyph.y : glyph.y + glyph.height, glyph.x : glyph.x + glyph.width]
+ink_px = int((crop > 0).sum())
+return ink_px / (glyph.width * glyph.height)
+```
+
+The value is embedded in the `quality` sub-object of every variant in
+`letter_set.json` and validated against the updated `letter_set.schema.json`
+(where `quality` is now a **required** field on `variant`).
+
+---
+
+## Near-Duplicate Deduplication: dHash + Hamming Distance
+
+### Algorithm
+
+**Step 1 — Compute a 64-bit difference hash (dHash) per crop.**
+
+dHash is a perceptual hash that captures the gradient structure of an image.
+For each glyph crop:
+
+1. Resize the crop to `(hash_size + 1) x hash_size` pixels using bilinear
+   interpolation (`hash_size = 8` by default → 9 x 8 = 72 pixels).
+2. Compare adjacent pixels in each row: for column `c` in row `r`, set bit 1
+   if `pixel[r, c] > pixel[r, c+1]`, else 0.
+3. Pack all `hash_size²` (64 at default) bits into a single Python `int`.
+
+**Step 2 — Greedy single-pass clustering per letter.**
+
+For each annotated letter (e.g. `'א'`), process variants in arrival order:
+
+- Compare the candidate's dHash against every already-selected
+  representative using Hamming distance.
+- If the minimum Hamming distance is ≤ `_DEDUP_HAMMING_THRESHOLD` (10),
+  the candidate is a near-duplicate of that representative:
+  - Keep whichever has the higher `ink_ratio`.
+- If no representative is within threshold, add the candidate as a new
+  representative.
+
+**Step 3 — Strip the internal `_dhash` key before writing.**
+
+dHash is used only during generation; it is not part of the schema and is
+removed from every variant dict before `letter_set.json` is written.
+
+### Design Decisions
+
+| Decision | Choice | Rationale |
+|---|---|---|
+| Hash algorithm | dHash | Pure OpenCV (no extra deps); fast; well-suited for binary glyph images |
+| Hash size | 8 (64 bits) | Standard; good sensitivity/collision balance for glyph-sized images |
+| Hamming threshold | 10 / 64 bits (~15 %) | Conventional loose threshold for perceptual hashing; empirical calibration deferred |
+| Dedup scope | Per letter, per writer | Cross-letter dedup is out of scope; cross-writer dedup is a downstream concern |
+| Clustering algorithm | Greedy single-pass | O(n²) per letter, but n is small (< 20 variants/letter per writer in practice) |
+| Tie-breaking | Higher `ink_ratio` | Proxy for legibility; avoids arbitrary selection |
+| dHash in schema | Not included | dHash is a generation-time implementation detail, not a stable output attribute |
+
+### Threshold Calibration
+
+The threshold of 10 / 64 bits is a conservative starting point from the
+perceptual-hashing literature.  Empirical calibration against real HeOCR corpus
+scans is deferred to a future sub-PR once a labelled near-duplicate test set is
+available.  The constant `_DEDUP_HAMMING_THRESHOLD` in `generator.py` is the
+single change point.
+
+---
+
+## Schema Changes
+
+`variant` in `letter_set.schema.json` (Draft 2020-12):
+
+- `"quality"` added to the `required` array.
+- `"quality"` object added with `"ink_ratio"` as the only required property
+  (`number`, range [0.0, 1.0]).
+- `additionalProperties: false` on `quality` allows forward extension via
+  schema revision.
+
+---
+
+## Public API additions (`extractor.py`)
+
+| Symbol | Type | Description |
+|---|---|---|
+| `compute_ink_ratio(binary, glyph)` | `float` | Ink fraction in [0.0, 1.0] |
+| `compute_dhash(binary, glyph, *, hash_size=8)` | `int` | 64-bit dHash |
+| `hamming_distance(a, b)` | `int` | Bit-wise Hamming distance |
+
+All three are exported via `__all__` and documented in the module docstring.
+
+---
+
+## Known Limitations / Out of Scope
+
+- **Hard ink_ratio filtering** — the metric is recorded but no automatic
+  drop threshold is applied in M4.  Consumers filter as needed.
+- **Cross-writer dedup** — different writers may produce near-identical glyphs;
+  dedup is scoped per writer per letter.
+- **Clustering quality** — greedy single-pass is order-dependent.  A full
+  clustering (e.g. DBSCAN on 64-bit hash space) is deferred.
+- **Nikud merging** — diacritical marks are still emitted as separate blobs;
+  merging with parent letter bodies deferred to M5.
+- **dHash on colour images** — dHash operates on the binarised image; this is
+  correct for ink-on-white glyph crops but would need adjustment if the
+  pipeline were extended to colour channels.
diff --git a/examples/letter_set/writer_example.json b/examples/letter_set/writer_example.json
index c46e1fd..58ca41f 100644
--- a/examples/letter_set/writer_example.json
+++ b/examples/letter_set/writer_example.json
@@ -25,6 +25,7 @@
         "asset_path": "letters/alef/alef-0001.png",
         "checksum_sha256": "0000000000000000000000000000000000000000000000000000000000000000",
         "image": { "width_px": 96, "height_px": 96, "format": "png" },
+        "quality": { "ink_ratio": 0.31 },
         "source": {
           "scan_entry_id": "example-scan-0001",
           "scan_url": "https://example.invalid/scans/example-scan-0001",
@@ -39,6 +40,7 @@
         "asset_path": "letters/alef/alef-0002.png",
         "checksum_sha256": "1111111111111111111111111111111111111111111111111111111111111111",
         "image": { "width_px": 92, "height_px": 100, "format": "png" },
+        "quality": { "ink_ratio": 0.27 },
         "source": {
           "scan_entry_id": "example-scan-0002",
           "license": "PDM-1.0",
@@ -52,6 +54,7 @@
         "asset_path": "letters/bet/bet-0001.png",
         "checksum_sha256": "2222222222222222222222222222222222222222222222222222222222222222",
         "image": { "width_px": 88, "height_px": 90, "format": "png" },
+        "quality": { "ink_ratio": 0.22 },
         "source": {
           "scan_entry_id": "example-scan-0001",
           "license": "CC-BY-SA-4.0",
@@ -66,6 +69,7 @@
         "asset_path": "letters/kaf-final/kaf-final-0001.png",
         "checksum_sha256": "3333333333333333333333333333333333333333333333333333333333333333",
         "image": { "width_px": 80, "height_px": 110, "format": "png" },
+        "quality": { "ink_ratio": 0.18 },
         "source": {
           "scan_entry_id": "example-scan-0002",
           "license": "PDM-1.0",
diff --git a/pyproject.toml b/pyproject.toml
index facc220..be3f9e3 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -33,6 +33,7 @@ lint = [
 ]
 typecheck = [
   "mypy>=1.10",
+  "numpy>=1.20",
 ]
 cv = [
   "opencv-python-headless>=4.9",
diff --git a/src/hletterscriptgen/extractor.py b/src/hletterscriptgen/extractor.py
index be0c668..ba55792 100644
--- a/src/hletterscriptgen/extractor.py
+++ b/src/hletterscriptgen/extractor.py
@@ -18,6 +18,9 @@
 * :class:`Glyph` — frozen dataclass for a detected blob's bounding box.
 * :func:`binarize_scan` — load a scan and return its Otsu-binarised array.
 * :func:`crop_binary` — crop a glyph from an already-binarised array.
+* :func:`compute_ink_ratio` — fraction of ink pixels in a glyph bbox.
+* :func:`compute_dhash` — 64-bit difference hash of a glyph crop.
+* :func:`hamming_distance` — Hamming distance between two integer hashes.
 * :func:`extract_glyphs` — detect blobs in a scan via CCA.
 * :func:`crop_glyph` — convenience wrapper: binarize a scan and crop one blob.
 """
@@ -285,13 +288,97 @@ def crop_glyph(image_path: Path, glyph: Glyph) -> bytes:
     return crop_binary(binarize_scan(image_path), glyph)
 
 
+def compute_ink_ratio(binary: Any, glyph: Glyph) -> float:
+    """Return the fraction of ink pixels in the glyph bbox.
+
+    Counts foreground pixels (value > 0 in the binarised array) within
+    the bounding box and divides by the total number of pixels in the box.
+    This is pure NumPy arithmetic and does **not** require OpenCV.
+
+    The result is in [0.0, 1.0]:
+    * Near 0.0 — almost-empty crop (noise / whitespace artefact).
+    * Near 1.0 — fully filled bbox (also suspicious — may be a ruled line
+      or bleed-through that passed the :data:`DEFAULT_MAX_AREA_FRACTION`
+      filter).
+    * 0.10-0.60 — typical range for legible Hebrew glyphs.
+
+    Parameters
+    ----------
+    binary:
+        A 2-D uint8 NumPy array produced by :func:`binarize_scan`.
+    glyph:
+        Bounding box whose ink ratio to compute.
+    """
+    crop = binary[glyph.y : glyph.y + glyph.height, glyph.x : glyph.x + glyph.width]
+    ink_px = int((crop > 0).sum())
+    return ink_px / (glyph.width * glyph.height)
+
+
+def compute_dhash(binary: Any, glyph: Glyph, *, hash_size: int = 8) -> int:
+    """Compute a difference hash (dHash) for the glyph crop.
+
+    Resizes the crop to ``(hash_size + 1) x hash_size`` pixels and encodes
+    horizontal pixel differences as a ``hash_size ** 2``-bit integer.
+    Identical or near-identical glyphs produce hashes with a low
+    :func:`hamming_distance`; hashes from visually distinct glyphs differ
+    by many bits.
+
+    The default ``hash_size=8`` yields a 64-bit hash (8 x 8 difference bits)
+    — a good balance of sensitivity and collision resistance for glyph-sized
+    images.  The bit layout matches ``np.packbits`` row-major order with MSB
+    first, so ``hash_size ** 2`` must be divisible by 8 for unambiguous
+    packing; only ``hash_size=8`` is supported in production.
+
+    Parameters
+    ----------
+    binary:
+        A 2-D uint8 NumPy array produced by :func:`binarize_scan`.
+    glyph:
+        Bounding box to hash.
+    hash_size:
+        Controls the hash width/height.  The hash contains
+        ``hash_size ** 2`` bits.  Defaults to 8.
+
+    Raises
+    ------
+    ExtractionError
+        When ``opencv-python-headless`` is not installed or ``hash_size`` < 1.
+    """
+    if hash_size < 1:
+        raise ExtractionError(f"hash_size must be >= 1, got {hash_size}")
+    import numpy as np
+
+    cv2 = _require_cv2()
+    crop = binary[glyph.y : glyph.y + glyph.height, glyph.x : glyph.x + glyph.width]
+    # Resize to (hash_size+1) columns x hash_size rows for horizontal differences.
+    small = cv2.resize(crop, (hash_size + 1, hash_size))
+    # Vectorised: compare each pixel against its right neighbour in one shot.
+    diff = small[:, :-1] > small[:, 1:]  # (hash_size, hash_size) bool array
+    # Pack bits MSB-first (np.packbits default) and interpret as a big-endian integer.
+    return int.from_bytes(np.packbits(diff).tobytes(), "big")
+
+
+def hamming_distance(a: int, b: int) -> int:
+    """Return the Hamming distance (number of differing bits) between two hashes.
+
+    Works on any non-negative integers; meaningful for hashes returned by
+    :func:`compute_dhash`.  A distance of 0 means the hashes are identical;
+    ≤ 10 out of 64 bits indicates near-identical images at the default
+    ``hash_size=8``.
+    """
+    return bin(a ^ b).count("1")
+
+
 __all__ = [
     "DEFAULT_MAX_AREA_FRACTION",
     "MIN_GLYPH_PX",
     "ExtractionError",
     "Glyph",
     "binarize_scan",
+    "compute_dhash",
+    "compute_ink_ratio",
     "crop_binary",
     "crop_glyph",
     "extract_glyphs",
+    "hamming_distance",
 ]
diff --git a/src/hletterscriptgen/generator.py b/src/hletterscriptgen/generator.py
index 2aa46bc..cc57216 100644
--- a/src/hletterscriptgen/generator.py
+++ b/src/hletterscriptgen/generator.py
@@ -1,4 +1,4 @@
-"""End-to-end glyph extraction pipeline (M3 MVP).
+"""End-to-end glyph extraction pipeline (M3/M4).
 
 Orchestrates the full generate flow:
 
@@ -7,11 +7,13 @@
 3. For each writer → each scan → each annotated glyph:
    a. Look up the upstream entry; skip ineligible entries (warn).
    b. Resolve the scan file path from the upstream checkout.
-   c. Binarise the scan once, then crop each glyph from the binary array.
-   d. Write the PNG to the output tree.
-   e. Record a ``variant`` for the letter_set.v1 document.
-4. Build and validate the ``letter_set.v1`` document for each writer.
-5. Write ``letter_set.json`` to the writer's output directory.
+   c. Binarise the scan once, then crop/hash/measure each glyph.
+   d. Accumulate variant dicts in memory (PNG bytes held, not yet written).
+4. Per letter: deduplicate near-duplicate variants by 64-bit dHash
+   (Hamming ≤ :data:`_DEDUP_HAMMING_THRESHOLD`); keep highest ``ink_ratio``.
+5. Write only the surviving PNG assets to the output tree.
+6. Build and validate the ``letter_set.v1`` document for each writer.
+7. Write ``letter_set.json`` to the writer's output directory.
 
 Output tree structure::
 
@@ -45,7 +47,15 @@
 from typing import Any
 
 from hletterscriptgen import HEBREW_LETTERS, __version__
-from hletterscriptgen.extractor import ExtractionError, Glyph, binarize_scan, crop_binary
+from hletterscriptgen.extractor import (
+    ExtractionError,
+    Glyph,
+    binarize_scan,
+    compute_dhash,
+    compute_ink_ratio,
+    crop_binary,
+    hamming_distance,
+)
 from hletterscriptgen.generate_profile import (
     GenerateProfile,
     GlyphAnnotation,
@@ -125,6 +135,69 @@ def _resolve_scan_path(
     return None
 
 
+# Hamming-distance threshold for near-duplicate dHash clustering.  Two glyphs
+# whose hashes differ by ≤ this many bits are considered near-duplicates; the
+# one with the higher ink_ratio is kept.  10 / 64 bits ≈ 15 % of bits differ,
+# which is the conventional loose threshold for perceptual-hash deduplication.
+_DEDUP_HAMMING_THRESHOLD: int = 10
+
+
+def _dedup_letter_variants(variants: list[dict[str, Any]]) -> list[dict[str, Any]]:
+    """Return a deduplicated copy of one letter's candidate variant list.
+
+    Clusters variants by 64-bit dHash using a greedy single-pass algorithm:
+    for each variant (in arrival order), check whether it falls within
+    :data:`_DEDUP_HAMMING_THRESHOLD` Hamming bits of any already-selected
+    representative.
+
+    * If it **matches** an existing representative and has a **higher**
+      ``ink_ratio``, the representative's payload (variant_id, asset_path,
+      checksum, quality, source, png bytes, …) is replaced with the
+      candidate's — but the **cluster-centre hash** (the original
+      representative's ``_dhash``) is **preserved**.  This prevents the
+      cluster centre from drifting across successive updates, which would
+      otherwise cause a chain A ≈ B ≈ C (but A ≁ C) to incorrectly absorb
+      C into A's cluster.
+    * If it **matches** but has an equal or lower ``ink_ratio``, the existing
+      representative is kept unchanged.
+    * If it does **not match** any representative, it starts a new cluster.
+
+    Internal keys (``_dhash``, ``_png_bytes``) are **not** stripped here;
+    the caller is responsible for stripping them and writing PNG files after
+    this function returns.
+
+    Parameters
+    ----------
+    variants:
+        List of variant dicts, each carrying temporary ``_dhash`` (int) and
+        ``_png_bytes`` (bytes) keys alongside the schema-visible fields.
+
+    Returns
+    -------
+    list[dict[str, Any]]
+        Deduplicated list retaining all input keys (including ``_dhash`` and
+        ``_png_bytes``).
+    """
+    representatives: list[dict[str, Any]] = []
+
+    for candidate in variants:
+        c_hash = candidate["_dhash"]
+        c_ink = candidate["quality"]["ink_ratio"]
+        matched = False
+        for rep in representatives:
+            if hamming_distance(c_hash, rep["_dhash"]) <= _DEDUP_HAMMING_THRESHOLD:
+                if c_ink > rep["quality"]["ink_ratio"]:
+                    cluster_hash = rep["_dhash"]  # preserve cluster centre
+                    rep.update(candidate)
+                    rep["_dhash"] = cluster_hash  # restore after bulk update
+                matched = True
+                break
+        if not matched:
+            representatives.append(dict(candidate))
+
+    return representatives
+
+
 def _extract_variants(
     writer: WriterAnnotation,
     entry_index: dict[str, UpstreamEntry],
@@ -133,18 +206,27 @@ def _extract_variants(
     generated_at: str,
     pending_warnings: list[str],
 ) -> tuple[dict[str, list[dict[str, Any]]], set[str], set[str]]:
-    """Crop all glyph variants for one writer and write PNG assets to disk.
+    """Crop, deduplicate, and write glyph variants for one writer.
 
     Returns ``(letters_map, observed_licenses, used_entry_ids)``.
     Non-fatal issues (missing entries, ineligible scans, crop failures) are
     appended to ``pending_warnings``.
 
+    Pipeline order within this function:
+
+    1. Accumulate all candidate variant dicts in memory, holding PNG bytes
+       under the temporary ``_png_bytes`` key (no disk writes yet).
+    2. Deduplicate per letter via :func:`_dedup_letter_variants`.
+    3. **Only then** write the surviving PNGs to disk, so that no files are
+       created for variants eliminated by dedup.
+    4. Derive ``used_entry_ids`` and ``observed_licenses`` from the survivors,
+       so that entries or licenses contributed solely by deduped-out variants
+       are not listed in the output manifest.
+
     The scan image is binarised once per scan file; all glyph crops for that
     scan share the same binary array, avoiding redundant I/O.
     """
     letters_map: dict[str, list[dict[str, Any]]] = {}
-    observed_licenses: set[str] = set()
-    used_entry_ids: set[str] = set()
 
     for scan in writer.scans:
         entry = entry_index.get(scan.entry_id)
@@ -196,8 +278,6 @@ def _extract_variants(
             )
             continue
 
-        used_entry_ids.add(scan.entry_id)
-
         for glyph_ann in scan.glyphs:
             if glyph_ann.letter not in HEBREW_LETTERS:
                 pending_warnings.append(
@@ -222,12 +302,15 @@ def _extract_variants(
                 )
                 continue
 
+            ink = compute_ink_ratio(binary, glyph)
+            dhash = compute_dhash(binary, glyph)
             rel_path = _asset_path(scan.entry_id, glyph_ann.letter, glyph_ann)
-            out_file = writer_out_dir / rel_path
-            out_file.parent.mkdir(parents=True, exist_ok=True)
-            out_file.write_bytes(png_bytes)
 
             variant: dict[str, Any] = {
+                # Internal keys stripped after dedup + file write (not in schema).
+                "_dhash": dhash,
+                "_png_bytes": png_bytes,
+                # Schema-visible fields.
                 "variant_id": _variant_id(scan.entry_id, glyph_ann.letter, glyph_ann),
                 "asset_path": rel_path,
                 "checksum_sha256": _sha256_hex(png_bytes),
@@ -236,6 +319,9 @@ def _extract_variants(
                     "height_px": glyph_ann.height,
                     "format": "png",
                 },
+                "quality": {
+                    "ink_ratio": ink,
+                },
                 "source": {
                     "scan_entry_id": scan.entry_id,
                     "license": license_expr,
@@ -252,7 +338,42 @@ def _extract_variants(
                 variant["notes"] = glyph_ann.notes
 
             letters_map.setdefault(glyph_ann.letter, []).append(variant)
-            observed_licenses.add(license_expr)
+
+    # Dedup: collapse near-duplicate variants per letter (Hamming ≤ threshold).
+    # Warn when any are dropped so callers have visibility into data loss.
+    for letter in letters_map:
+        pre_count = len(letters_map[letter])
+        letters_map[letter] = _dedup_letter_variants(letters_map[letter])
+        dropped = pre_count - len(letters_map[letter])
+        if dropped:
+            pending_warnings.append(
+                f"writer {writer.writer_id!r}: dropped {dropped} near-duplicate "
+                f"variant(s) for letter {letter!r} "
+                f"(dHash Hamming <= {_DEDUP_HAMMING_THRESHOLD})"
+            )
+
+    # Write only surviving variants to disk, then strip internal keys.
+    # Doing this after dedup guarantees no orphaned PNG files for eliminated variants.
+    for variants in letters_map.values():
+        for variant in variants:
+            png = variant.pop("_png_bytes")
+            variant.pop("_dhash")
+            out_file = writer_out_dir / variant["asset_path"]
+            out_file.parent.mkdir(parents=True, exist_ok=True)
+            out_file.write_bytes(png)
+
+    # Derive from survivors only: entries or licenses contributed solely by
+    # deduped-out variants must not appear in the manifest.
+    used_entry_ids: set[str] = {
+        v["source"]["scan_entry_id"]
+        for variants in letters_map.values()
+        for v in variants
+    }
+    observed_licenses: set[str] = {
+        v["source"]["license"]
+        for variants in letters_map.values()
+        for v in variants
+    }
 
     return letters_map, observed_licenses, used_entry_ids
 
diff --git a/src/hletterscriptgen/schemas/letter_set.schema.json b/src/hletterscriptgen/schemas/letter_set.schema.json
index 76febb4..14d8950 100644
--- a/src/hletterscriptgen/schemas/letter_set.schema.json
+++ b/src/hletterscriptgen/schemas/letter_set.schema.json
@@ -142,6 +142,7 @@
         "asset_path",
         "checksum_sha256",
         "image",
+        "quality",
         "source"
       ],
       "properties": {
@@ -172,6 +173,20 @@
             }
           }
         },
+        "quality": {
+          "type": "object",
+          "description": "Per-variant quality metrics computed from the binarised glyph crop during generation.",
+          "additionalProperties": false,
+          "required": ["ink_ratio"],
+          "properties": {
+            "ink_ratio": {
+              "type": "number",
+              "minimum": 0.0,
+              "maximum": 1.0,
+              "description": "Fraction of pixels in the bounding box that are foreground (ink) in the binarised image. 0.0 = empty crop, 1.0 = entirely filled. Typical values for legible Hebrew glyphs are 0.10–0.60."
+            }
+          }
+        },
         "source": {
           "type": "object",
           "additionalProperties": false,
diff --git a/tests/test_extractor.py b/tests/test_extractor.py
index b62f384..a97a3d0 100644
--- a/tests/test_extractor.py
+++ b/tests/test_extractor.py
@@ -13,8 +13,12 @@
     MIN_GLYPH_PX,
     ExtractionError,
     Glyph,
+    binarize_scan,
+    compute_dhash,
+    compute_ink_ratio,
     crop_glyph,
     extract_glyphs,
+    hamming_distance,
 )
 
 # ---------------------------------------------------------------------------
@@ -241,3 +245,126 @@ def test_crop_glyph_raises_on_out_of_bounds(tmp_path: Path) -> None:
     out_of_bounds = Glyph(x=40, y=40, width=20, height=20)  # extends past 50x50
     with pytest.raises(ExtractionError, match="outside"):
         crop_glyph(scan, out_of_bounds)
+
+
+# ---------------------------------------------------------------------------
+# compute_ink_ratio
+# ---------------------------------------------------------------------------
+
+
+def _binarize_array(img: np.ndarray) -> np.ndarray:  # type: ignore[type-arg]
+    """Binarise a synthetic BGR image the same way binarize_scan does."""
+    grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
+    _, binary = cv2.threshold(grey, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)
+    return binary
+
+
+def test_ink_ratio_fully_filled() -> None:
+    """A glyph bbox that is entirely ink should have ratio 1.0."""
+    img = _white_image(50, 50)
+    _draw_black_rect(img, x=0, y=0, w=50, h=50)  # entire image is black
+    binary = _binarize_array(img)
+    glyph = Glyph(x=0, y=0, width=50, height=50)
+    ratio = compute_ink_ratio(binary, glyph)
+    assert ratio == pytest.approx(1.0, abs=1e-6)
+
+
+def test_ink_ratio_empty_crop() -> None:
+    """A crop with no ink pixels should have ratio 0.0."""
+    img = _white_image(50, 50)  # entirely white — no ink
+    binary = _binarize_array(img)
+    glyph = Glyph(x=0, y=0, width=50, height=50)
+    ratio = compute_ink_ratio(binary, glyph)
+    assert ratio == pytest.approx(0.0, abs=1e-6)
+
+
+def test_ink_ratio_partial_fill() -> None:
+    """Half-filled crop should yield ratio ≈ 0.5."""
+    # 20x10 bbox: fill the top 10x10 half with black
+    img = _white_image(40, 20)
+    _draw_black_rect(img, x=0, y=0, w=10, h=10)
+    binary = _binarize_array(img)
+    glyph = Glyph(x=0, y=0, width=20, height=10)
+    ratio = compute_ink_ratio(binary, glyph)
+    # 10x10 ink in a 20x10 bbox → exactly 0.5
+    assert ratio == pytest.approx(0.5, abs=1e-6)
+
+
+def test_ink_ratio_is_in_unit_interval(tmp_path: Path) -> None:
+    """ink_ratio must always be in [0.0, 1.0] for any binary input."""
+    img = _white_image(100, 100)
+    _draw_black_rect(img, x=10, y=10, w=30, h=30)
+    scan = tmp_path / "scan.png"
+    _save_png(img, scan)
+    binary = binarize_scan(scan)
+    glyph = Glyph(x=10, y=10, width=30, height=30)
+    ratio = compute_ink_ratio(binary, glyph)
+    assert 0.0 <= ratio <= 1.0
+
+
+# ---------------------------------------------------------------------------
+# compute_dhash / hamming_distance
+# ---------------------------------------------------------------------------
+
+
+def test_dhash_returns_integer() -> None:
+    """compute_dhash must return a plain int."""
+    img = _white_image(50, 50)
+    _draw_black_rect(img, x=5, y=5, w=20, h=20)
+    binary = _binarize_array(img)
+    glyph = Glyph(x=5, y=5, width=20, height=20)
+    h = compute_dhash(binary, glyph)
+    assert isinstance(h, int)
+
+
+def test_dhash_identical_glyphs_have_zero_distance() -> None:
+    """The same crop hashed twice should produce identical hashes."""
+    img = _white_image(60, 60)
+    _draw_black_rect(img, x=10, y=10, w=30, h=30)
+    binary = _binarize_array(img)
+    glyph = Glyph(x=10, y=10, width=30, height=30)
+    h1 = compute_dhash(binary, glyph)
+    h2 = compute_dhash(binary, glyph)
+    assert hamming_distance(h1, h2) == 0
+
+
+def test_dhash_different_glyphs_have_positive_distance() -> None:
+    """Two structurally different crops should produce hashes with distance > 0.
+
+    Build the binary array directly (skip Otsu) so that glyph A has ink on
+    its left half and glyph B has ink on its right half.  The resulting
+    horizontal-difference patterns are mirror images → hashes differ.
+    """
+    binary = np.zeros((40, 100), dtype=np.uint8)
+    # Glyph A: left 20 px of a 40x30 bbox are ink, right 20 px are background.
+    binary[5:35, 0:20] = 255
+    # Glyph B: right 20 px of a 40x30 bbox are ink, left 20 px are background.
+    binary[5:35, 80:100] = 255
+    g_a = Glyph(x=0, y=5, width=40, height=30)
+    g_b = Glyph(x=60, y=5, width=40, height=30)
+    ha = compute_dhash(binary, g_a)
+    hb = compute_dhash(binary, g_b)
+    assert hamming_distance(ha, hb) > 0
+
+
+def test_hamming_distance_identical() -> None:
+    assert hamming_distance(0b1010, 0b1010) == 0
+
+
+def test_hamming_distance_all_differ() -> None:
+    """0x00 vs 0xFF for an 8-bit value should give distance 8."""
+    assert hamming_distance(0x00, 0xFF) == 8
+
+
+def test_hamming_distance_single_bit() -> None:
+    assert hamming_distance(0b0001, 0b0000) == 1
+
+
+def test_dhash_default_is_64_bits() -> None:
+    """Default hash_size=8 should produce a value expressible in ≤ 64 bits."""
+    img = _white_image(50, 50)
+    _draw_black_rect(img, x=5, y=5, w=20, h=20)
+    binary = _binarize_array(img)
+    glyph = Glyph(x=5, y=5, width=20, height=20)
+    h = compute_dhash(binary, glyph)
+    assert 0 <= h < (1 << 64)
diff --git a/tests/test_generator.py b/tests/test_generator.py
index d3c0e70..ba2b2da 100644
--- a/tests/test_generator.py
+++ b/tests/test_generator.py
@@ -5,6 +5,7 @@
 import json
 import subprocess
 from pathlib import Path
+from typing import Any
 from unittest.mock import MagicMock, patch
 
 import pytest
@@ -13,7 +14,12 @@
 import numpy as np  # noqa: E402
 
 from hletterscriptgen.generate_profile import load_generate_profile  # noqa: E402
-from hletterscriptgen.generator import GeneratorError, generate  # noqa: E402
+from hletterscriptgen.generator import (  # noqa: E402
+    GeneratorError,
+    GeneratorWarning,
+    _dedup_letter_variants,
+    generate,
+)
 
 # ---------------------------------------------------------------------------
 # Helpers — synthetic upstream checkout + scan
@@ -343,11 +349,12 @@ def test_generate_document_structure_mocked(tmp_path: Path) -> None:
     output_dir = tmp_path / "out"
 
     profile = load_generate_profile(profile_path)
-    fake_binary = MagicMock()
 
-    with patch("hletterscriptgen.generator.binarize_scan", return_value=fake_binary):
+    with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()):
         with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG):
-            paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00")
+            with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3):
+                with patch("hletterscriptgen.generator.compute_dhash", return_value=0):
+                    paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00")
 
     assert len(paths) == 1
     doc = json.loads(paths[0].read_text(encoding="utf-8"))
@@ -368,7 +375,9 @@ def test_generate_mocked_validates(tmp_path: Path) -> None:
     profile = load_generate_profile(profile_path)
     with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()):
         with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG):
-            paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00")
+            with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3):
+                with patch("hletterscriptgen.generator.compute_dhash", return_value=0):
+                    paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00")
 
     doc = json.loads(paths[0].read_text(encoding="utf-8"))
     result = validate_document(doc)
@@ -384,7 +393,9 @@ def test_generate_mocked_writes_png_assets(tmp_path: Path) -> None:
     profile = load_generate_profile(profile_path)
     with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()):
         with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG):
-            paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00")
+            with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3):
+                with patch("hletterscriptgen.generator.compute_dhash", return_value=0):
+                    paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00")
 
     doc = json.loads(paths[0].read_text(encoding="utf-8"))
     writer_dir = paths[0].parent
@@ -426,7 +437,9 @@ def test_generate_mocked_glyph_notes_in_variant(tmp_path: Path) -> None:
     profile = load_generate_profile(p)
     with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()):
         with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG):
-            paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00")
+            with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3):
+                with patch("hletterscriptgen.generator.compute_dhash", return_value=0):
+                    paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00")
 
     doc = json.loads(paths[0].read_text(encoding="utf-8"))
     variant = doc["letters"]["א"][0]
@@ -442,7 +455,356 @@ def test_generate_mocked_config_hash_in_document(tmp_path: Path) -> None:
     profile = load_generate_profile(profile_path)
     with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()):
         with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG):
-            paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00")
+            with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3):
+                with patch("hletterscriptgen.generator.compute_dhash", return_value=0):
+                    paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00")
 
     doc = json.loads(paths[0].read_text(encoding="utf-8"))
     assert doc["generator"]["config_hash"] == profile.config_hash
+
+
+# ---------------------------------------------------------------------------
+# _dedup_letter_variants — isolated unit tests
+# ---------------------------------------------------------------------------
+
+# Build a minimal variant dict that satisfies the function's expectations.
+# Internal keys (_dhash, _png_bytes) must be present; schema keys are minimal.
+def _v(
+    vid: str,
+    ink: float,
+    dhash: int,
+    *,
+    entry_id: str = "e1",
+    license_id: str = "PDM-1.0",
+) -> dict[str, Any]:
+    return {
+        "_dhash": dhash,
+        "_png_bytes": b"\x89PNG stub",
+        "variant_id": vid,
+        "asset_path": f"glyphs/א/{vid}.png",
+        "checksum_sha256": "0" * 64,
+        "image": {"width_px": 20, "height_px": 20, "format": "png"},
+        "quality": {"ink_ratio": ink},
+        "source": {
+            "scan_entry_id": entry_id,
+            "license": license_id,
+            "bbox_in_source": {"x": 0, "y": 0, "width": 20, "height": 20},
+        },
+        "extracted_at": "2025-01-01T00:00:00+00:00",
+    }
+
+
+def test_dedup_single_variant_returned_unchanged() -> None:
+    """A single variant survives dedup unmodified."""
+    v = _v("v1", 0.3, dhash=0)
+    result = _dedup_letter_variants([v])
+    assert len(result) == 1
+    assert result[0]["variant_id"] == "v1"
+    assert result[0]["_dhash"] == 0  # caller strips; still present here
+
+
+def test_dedup_identical_hashes_keeps_higher_ink_ratio() -> None:
+    """Two variants with Hamming distance 0 collapse to one; higher ink wins."""
+    low = _v("v-low", ink=0.20, dhash=0)
+    high = _v("v-high", ink=0.45, dhash=0)
+    result = _dedup_letter_variants([low, high])
+    assert len(result) == 1
+    assert result[0]["quality"]["ink_ratio"] == pytest.approx(0.45)
+    assert result[0]["variant_id"] == "v-high"
+
+
+def test_dedup_identical_hashes_first_wins_when_ink_equal_or_lower() -> None:
+    """When the representative already has better ink, it is not replaced."""
+    first = _v("v-first", ink=0.50, dhash=7)
+    second = _v("v-second", ink=0.30, dhash=7)
+    result = _dedup_letter_variants([first, second])
+    assert len(result) == 1
+    assert result[0]["variant_id"] == "v-first"
+
+
+def test_dedup_distinct_hashes_both_survive() -> None:
+    """Two variants with Hamming distance > threshold both survive."""
+    a = _v("v-a", ink=0.3, dhash=0x0000_0000_0000_0000)
+    b = _v("v-b", ink=0.3, dhash=0xFFFF_FFFF_FFFF_FFFF)
+    result = _dedup_letter_variants([a, b])
+    assert len(result) == 2
+
+
+def test_dedup_threshold_boundary_at_exactly_threshold() -> None:
+    """Hamming distance exactly equal to the threshold is still a near-dupe."""
+    from hletterscriptgen.generator import _DEDUP_HAMMING_THRESHOLD
+
+    # Build a hash that differs by exactly _DEDUP_HAMMING_THRESHOLD bits from 0.
+    border_hash = (1 << _DEDUP_HAMMING_THRESHOLD) - 1  # lowest N bits set
+    assert bin(border_hash).count("1") == _DEDUP_HAMMING_THRESHOLD
+
+    a = _v("v-a", ink=0.3, dhash=0)
+    b = _v("v-b", ink=0.4, dhash=border_hash)
+    result = _dedup_letter_variants([a, b])
+    assert len(result) == 1
+
+
+def test_dedup_threshold_boundary_one_over_survives() -> None:
+    """Hamming distance one above threshold is distinct; both variants survive."""
+    from hletterscriptgen.generator import _DEDUP_HAMMING_THRESHOLD
+
+    over_hash = (1 << (_DEDUP_HAMMING_THRESHOLD + 1)) - 1  # N+1 bits set
+    assert bin(over_hash).count("1") == _DEDUP_HAMMING_THRESHOLD + 1
+
+    a = _v("v-a", ink=0.3, dhash=0)
+    b = _v("v-b", ink=0.3, dhash=over_hash)
+    result = _dedup_letter_variants([a, b])
+    assert len(result) == 2
+
+
+def test_dedup_cluster_centre_preserved_prevents_drift() -> None:
+    """Cluster centre hash must not drift when the representative is replaced.
+
+    Scenario (hashes chosen so that |A-B| <= threshold but |A-C| > threshold,
+    yet |B-C| <= threshold):
+
+    - A processed first → cluster centre = A_hash.
+    - B processed: near A → B replaces A (higher ink); centre stays A_hash.
+    - C processed: compared against A_hash (not B_hash).
+      |A-C| > threshold → C survives as a new cluster.
+
+    Without the fix, C would be compared against B_hash, |B-C| <= threshold,
+    and C would be incorrectly absorbed.
+    """
+    # A_hash = 0 (all zeros, 64 bits)
+    # B_hash = 0xFF (8 bits set): hamming(A,B) = 8 <= 10 (near-dupe)
+    # C_hash = 0xFFF (12 bits set): hamming(A,C) = 12 > 10 (distinct from A)
+    #          but hamming(B,C) = hamming(0xFF, 0xFFF) = 4 <= 10 (near B)
+    A_hash = 0x00
+    B_hash = 0xFF        # 8 bits set; hamming(A,B)=8 <= 10
+    C_hash = 0xFFF       # 12 bits set; hamming(A,C)=12 > 10; hamming(B,C)=4 <= 10
+
+    a = _v("v-a", ink=0.20, dhash=A_hash)
+    b = _v("v-b", ink=0.40, dhash=B_hash)  # better ink → replaces A in cluster
+    c = _v("v-c", ink=0.30, dhash=C_hash)  # must survive as its own cluster
+
+    result = _dedup_letter_variants([a, b, c])
+    assert len(result) == 2, (
+        "C should survive as a distinct cluster (|A-C|=12 > threshold), "
+        "but the cluster centre drifted to B_hash and absorbed C"
+    )
+    ids = {r["variant_id"] for r in result}
+    assert "v-b" in ids  # winner of first cluster
+    assert "v-c" in ids  # distinct second cluster
+
+
+def test_dedup_internal_keys_not_stripped() -> None:
+    """_dedup_letter_variants must NOT strip _dhash or _png_bytes — that is the caller's job."""
+    v = _v("v1", 0.3, dhash=42)
+    result = _dedup_letter_variants([v])
+    assert "_dhash" in result[0]
+    assert "_png_bytes" in result[0]
+
+
+# ---------------------------------------------------------------------------
+# M4: quality metrics
+# ---------------------------------------------------------------------------
+
+
+def test_generate_variant_has_quality_ink_ratio(tmp_path: Path) -> None:
+    """Every generated variant must carry a quality.ink_ratio in [0, 1]."""
+    upstream = _make_upstream_checkout(tmp_path)
+    profile_path = _make_profile(tmp_path, upstream)
+    output_dir = tmp_path / "out"
+
+    profile = load_generate_profile(profile_path)
+    paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00")
+
+    doc = json.loads(paths[0].read_text(encoding="utf-8"))
+    for variants in doc["letters"].values():
+        for variant in variants:
+            assert "quality" in variant, f"variant {variant['variant_id']!r} missing 'quality'"
+            ink = variant["quality"]["ink_ratio"]
+            assert isinstance(ink, float), f"ink_ratio must be float, got {type(ink)}"
+            assert 0.0 <= ink <= 1.0, f"ink_ratio out of range: {ink}"
+
+
+def test_generate_variant_ink_ratio_nonzero_for_solid_blobs(tmp_path: Path) -> None:
+    """Solid black blobs should produce an ink_ratio > 0."""
+    upstream = _make_upstream_checkout(tmp_path)
+    profile_path = _make_profile(tmp_path, upstream)
+    output_dir = tmp_path / "out"
+
+    profile = load_generate_profile(profile_path)
+    paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00")
+
+    doc = json.loads(paths[0].read_text(encoding="utf-8"))
+    for variants in doc["letters"].values():
+        for variant in variants:
+            assert variant["quality"]["ink_ratio"] > 0.0
+
+
+def test_generate_mocked_variant_has_quality(tmp_path: Path) -> None:
+    """Mock path: quality.ink_ratio must still be present and valid."""
+    upstream = _make_upstream_checkout_no_cv2(tmp_path)
+    profile_path = _make_profile(tmp_path, upstream)
+    output_dir = tmp_path / "out"
+
+    profile = load_generate_profile(profile_path)
+    with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()):
+        with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG):
+            with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.25):
+                with patch("hletterscriptgen.generator.compute_dhash", return_value=0):
+                    paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00")
+
+    doc = json.loads(paths[0].read_text(encoding="utf-8"))
+    for variants in doc["letters"].values():
+        for variant in variants:
+            assert variant["quality"]["ink_ratio"] == pytest.approx(0.25)
+
+
+# ---------------------------------------------------------------------------
+# M4: near-duplicate deduplication
+# ---------------------------------------------------------------------------
+
+
+def test_generate_dedup_removes_near_duplicate(tmp_path: Path) -> None:
+    """Two glyphs with identical dHash (Hamming 0) should collapse to one."""
+    upstream = _make_upstream_checkout_no_cv2(tmp_path)
+
+    # Two glyphs annotated on the same entry, same letter — will get identical
+    # hashes because compute_dhash is patched to return the same value.
+    profile_data = {
+        "upstream_checkout": str(upstream),
+        "writers": [
+            {
+                "writer_id": "writer_test_a",
+                "attribution_method": "manual_review",
+                "scans": [
+                    {
+                        "entry_id": "test__writer_a__p0001",
+                        "glyphs": [
+                            {"letter": "א", "x": 5, "y": 5, "width": 20, "height": 20},
+                            {"letter": "א", "x": 5, "y": 5, "width": 20, "height": 20},
+                        ],
+                    }
+                ],
+            }
+        ],
+    }
+    p = tmp_path / "profile.json"
+    p.write_text(json.dumps(profile_data), encoding="utf-8")
+
+    output_dir = tmp_path / "out"
+    profile = load_generate_profile(p)
+
+    ts = "2025-01-01T00:00:00+00:00"
+    with pytest.warns(GeneratorWarning, match="near-duplicate"):
+        with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()):
+            with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG):
+                with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3):
+                    with patch("hletterscriptgen.generator.compute_dhash", return_value=42):
+                        paths = generate(profile, output_dir, generated_at=ts)
+
+    doc = json.loads(paths[0].read_text(encoding="utf-8"))
+    # Both annotations had the same hash → dedup leaves exactly one variant
+    assert len(doc["letters"]["א"]) == 1
+
+
+def test_generate_dedup_keeps_higher_ink_ratio(tmp_path: Path) -> None:
+    """When two near-dupes differ in ink_ratio, the higher one survives."""
+    upstream = _make_upstream_checkout_no_cv2(tmp_path)
+
+    profile_data = {
+        "upstream_checkout": str(upstream),
+        "writers": [
+            {
+                "writer_id": "writer_test_a",
+                "attribution_method": "manual_review",
+                "scans": [
+                    {
+                        "entry_id": "test__writer_a__p0001",
+                        "glyphs": [
+                            {"letter": "א", "x": 5, "y": 5, "width": 20, "height": 20},
+                            {"letter": "א", "x": 5, "y": 5, "width": 20, "height": 20},
+                        ],
+                    }
+                ],
+            }
+        ],
+    }
+    p = tmp_path / "profile.json"
+    p.write_text(json.dumps(profile_data), encoding="utf-8")
+
+    output_dir = tmp_path / "out"
+    profile = load_generate_profile(p)
+
+    # First call returns 0.20, second returns 0.45 (better)
+    ink_side_effect = [0.20, 0.45]
+
+    ts = "2025-01-01T00:00:00+00:00"
+    with pytest.warns(GeneratorWarning, match="near-duplicate"):
+        with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()):
+            with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG):
+                with patch("hletterscriptgen.generator.compute_ink_ratio", side_effect=ink_side_effect):  # noqa: E501
+                    with patch("hletterscriptgen.generator.compute_dhash", return_value=0):
+                        paths = generate(profile, output_dir, generated_at=ts)
+
+    doc = json.loads(paths[0].read_text(encoding="utf-8"))
+    assert len(doc["letters"]["א"]) == 1
+    assert doc["letters"]["א"][0]["quality"]["ink_ratio"] == pytest.approx(0.45)
+
+
+def test_generate_dedup_keeps_distinct_glyphs(tmp_path: Path) -> None:
+    """Two glyphs with Hamming distance > threshold must both survive dedup."""
+    upstream = _make_upstream_checkout_no_cv2(tmp_path)
+    profile_data = {
+        "upstream_checkout": str(upstream),
+        "writers": [
+            {
+                "writer_id": "writer_test_a",
+                "attribution_method": "manual_review",
+                "scans": [
+                    {
+                        "entry_id": "test__writer_a__p0001",
+                        "glyphs": [
+                            {"letter": "א", "x": 5, "y": 5, "width": 20, "height": 20},
+                            {"letter": "א", "x": 35, "y": 5, "width": 20, "height": 20},
+                        ],
+                    }
+                ],
+            }
+        ],
+    }
+    p = tmp_path / "profile.json"
+    p.write_text(json.dumps(profile_data), encoding="utf-8")
+
+    output_dir = tmp_path / "out"
+    profile = load_generate_profile(p)
+
+    # Hashes differ by >> 10 bits → both survive
+    with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()):
+        with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG):
+            with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3):
+                with patch(
+                    "hletterscriptgen.generator.compute_dhash",
+                    side_effect=[0x0000_0000_0000_0000, 0xFFFF_FFFF_FFFF_FFFF],
+                ):
+                    paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00")
+
+    doc = json.loads(paths[0].read_text(encoding="utf-8"))
+    assert len(doc["letters"]["א"]) == 2
+
+
+def test_generate_variant_no_dhash_in_output(tmp_path: Path) -> None:
+    """The internal _dhash key must not leak into the written letter_set.json."""
+    upstream = _make_upstream_checkout_no_cv2(tmp_path)
+    profile_path = _make_profile(tmp_path, upstream)
+    output_dir = tmp_path / "out"
+
+    profile = load_generate_profile(profile_path)
+    with patch("hletterscriptgen.generator.binarize_scan", return_value=MagicMock()):
+        with patch("hletterscriptgen.generator.crop_binary", return_value=_FAKE_PNG):
+            with patch("hletterscriptgen.generator.compute_ink_ratio", return_value=0.3):
+                with patch("hletterscriptgen.generator.compute_dhash", return_value=0):
+                    paths = generate(profile, output_dir, generated_at="2025-01-01T00:00:00+00:00")
+
+    doc = json.loads(paths[0].read_text(encoding="utf-8"))
+    for variants in doc["letters"].values():
+        for variant in variants:
+            assert "_dhash" not in variant