Skip to content

feat(m4): variant quality metrics and near-duplicate deduplication#23

Merged
shaypal5 merged 3 commits into
mainfrom
feat/m4-quality-dedupe
May 25, 2026
Merged

feat(m4): variant quality metrics and near-duplicate deduplication#23
shaypal5 merged 3 commits into
mainfrom
feat/m4-quality-dedupe

Conversation

@shaypal5

Copy link
Copy Markdown
Contributor

Summary

Implements M4: per-variant ink-ratio quality metric and greedy dHash near-duplicate deduplication.

Changes

extractor.py — three new public functions:

  • compute_ink_ratio(binary, glyph) → float — fraction of ink pixels in the bbox; range [0.0, 1.0]
  • compute_dhash(binary, glyph, *, hash_size=8) → int — 64-bit difference hash (perceptual hash based on horizontal pixel gradients)
  • hamming_distance(a, b) → int — bit-wise Hamming distance between two integer hashes

letter_set.schema.jsonquality is now a required field on every variant:

"quality": { "ink_ratio": 0.31 }

generator.py — two pipeline additions per writer:

  1. Compute ink_ratio and _dhash for each glyph crop after binarisation.
  2. Greedy dedup pass (_dedup_letter_variants) per letter: collapses near-duplicates (Hamming ≤ 10) keeping the variant with the highest ink_ratio. The internal _dhash key is stripped before the document is written.

Tests — 20 new test cases:

  • Extractor: ink_ratio (fully filled, empty, partial, unit-interval), dHash (type, identity, distinctness, bit width), Hamming distance (edge cases)
  • Generator: quality embedding (real cv2 path + mocked path), dedup (remove near-dupe, keep best ink_ratio, keep distinct glyphs, no _dhash leakage)
  • All pre-existing mock tests updated to patch compute_ink_ratio and compute_dhash.

examples/letter_set/writer_example.json — all four fixture variants updated with plausible quality.ink_ratio values.

docs/design/quality-and-dedupe.md — design record covering algorithm spec, decision table, schema changes, and known limitations.

Key design decisions

Decision Choice
Quality metric ink_ratio (ink fraction from existing binary array; no extra deps)
Hash algorithm dHash (pure OpenCV, fast, good for binary glyph images)
Hamming threshold 10 / 64 bits — conventional loose threshold, calibration deferred
Tie-breaking Higher ink_ratio wins
dHash in schema No — generation-time detail only

CI

All 167 tests pass, 91.5% coverage, ruff clean, mypy strict clean.

Closes #22

🤖 Generated with Claude Code

shaypal5 and others added 3 commits May 25, 2026 00:31
- extractor: add compute_ink_ratio, compute_dhash, hamming_distance
- schema: quality.ink_ratio is now required on every variant
- generator: embed quality.ink_ratio per variant; greedy dHash dedup pass
  collapses near-duplicates (Hamming ≤ 10) per letter, keeping highest
  ink_ratio; _dhash key stripped before writing letter_set.json
- examples: add quality.ink_ratio to all fixtures in writer_example.json
- tests: 20 new tests covering ink_ratio, dHash, hamming_distance,
  quality embedding, dedup logic (remove dupes, keep best ink_ratio,
  keep distinct glyphs, no _dhash leakage)
- docs: add docs/design/quality-and-dedupe.md design record

Closes #22

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Correctness fixes:
- generator: write PNG files only after dedup, not before; eliminates
  orphaned files for variants dropped during near-duplicate clustering
- generator: preserve cluster-centre _dhash when updating representative
  with a higher-ink_ratio candidate; prevents greedy cluster from drifting
  across successive substitutions (A->B->C chain bug)
- generator: derive used_entry_ids and observed_licenses from post-dedup
  survivors only; entries/licenses contributed solely by deduped-out
  variants no longer appear in the manifest

Design fixes:
- extractor: remove spurious _require_cv2() from compute_ink_ratio; the
  function is pure NumPy arithmetic and has no OpenCV dependency
- extractor: vectorise compute_dhash with np.packbits instead of Python
  nested loop over individual numpy elements; also add hash_size >= 1
  validation
- generator: update module docstring to describe M3/M4 pipeline steps

Observability:
- generator: emit GeneratorWarning when dedup drops variants, with letter
  and count; callers now have visibility into silent data loss

Test quality:
- test_generator: remove dead fake_binary MagicMock wiring in
  test_generate_mocked_variant_has_quality (was overridden by patch)
- test_generator: add 8 isolated unit tests for _dedup_letter_variants
  including threshold boundary, cluster-drift regression, single variant,
  and internal-key-not-stripped assertions
- test_generator: two integration dedup tests now use pytest.warns to
  assert the GeneratorWarning is actually emitted
- _dedup_letter_variants: no longer strips _dhash/_png_bytes; caller
  strips both after writing survivors (cleaner separation of concerns)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
compute_dhash now uses numpy (np.packbits) for vectorised hashing.
The mypy CI job installs only .[typecheck] — no cv extra — so numpy was
not available and mypy reported import-not-found.  Numpy 1.20+ ships its
own py.typed stubs, so adding it to typecheck gives real type coverage
rather than a suppressed import.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shaypal5 shaypal5 merged commit 562ef14 into main May 25, 2026
7 checks passed
@shaypal5 shaypal5 deleted the feat/m4-quality-dedupe branch May 25, 2026 04:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

M4 — Variant quality and dedupe

1 participant