diff --git a/AGENTS.md b/AGENTS.md index bb6207d..cf2d1ad 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -26,7 +26,7 @@ hletterscriptgen validate examples/letter_set/writer_example.json --format json scans into per-writer letter-glyph image sets. - `hletterscript` (separate repo) owns the **published letter-set datasets**. Do not commit generated glyph images to this repo. -- `public-domain-hand-written-hebrew-scans` (separate repo) owns +- `hash` (separate repo) owns **upstream scans** and their rights records. - `hocrsyngen`, `hocrgen`, `HeOCR`, `HeOCRsynth` are downstream consumers. Do not import them from `hletterscriptgen` and do not build their @@ -44,7 +44,7 @@ hletterscriptgen validate examples/letter_set/writer_example.json --format json ## Rights-carryover rules - Every variant must carry a `source.scan_entry_id` that resolves against - the upstream `public-domain-hand-written-hebrew-scans` index, plus a + the upstream `hash` index, plus a `source.license` matching the upstream record. The generator never invents, broadens, or relicenses upstream rights. - `license_summary.licenses` must include every distinct license that diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 07b80ef..d1fd624 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -7,7 +7,7 @@ Thanks for considering a contribution to `hletterscriptgen`. This repo holds the **code** that produces per-writer Hebrew letter-glyph image sets. It does **not** host the letter-set images themselves (those live in `HeOCR/hletterscript`), and it does **not** ingest upstream scans -(those live in `HeOCR/public-domain-hand-written-hebrew-scans`). Please +(those live in `HeOCR/hash`). Please keep PRs aligned with that boundary; cross-repo concerns belong upstream or downstream. diff --git a/LICENSE-POLICY.md b/LICENSE-POLICY.md index d12413b..bc9a186 100644 --- a/LICENSE-POLICY.md +++ b/LICENSE-POLICY.md @@ -16,7 +16,7 @@ rules apply to each layer. ## 2. Generated letter-set datasets The generator processes scans from -[`HeOCR/public-domain-hand-written-hebrew-scans`](https://github.com/HeOCR/public-domain-hand-written-hebrew-scans). +[`HeOCR/hash`](https://github.com/HeOCR/hash). That upstream repository uses a compound licensing model with rights recorded **per scan**. `hletterscriptgen` follows the same posture: diff --git a/README.md b/README.md index b7dccfb..66ddedb 100644 --- a/README.md +++ b/README.md @@ -1,52 +1,73 @@ # hletterscriptgen -Generator framework for **per-writer Hebrew letter-glyph image sets**, built -on rights-clean upstream scans of handwritten Hebrew documents. - -`hletterscriptgen` is part of the [HeOCR](https://github.com/HeOCR) project. -It consumes scan-level records from -[`HeOCR/public-domain-hand-written-hebrew-scans`](https://github.com/HeOCR/public-domain-hand-written-hebrew-scans) -and produces letter-set datasets that land in -[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript). Downstream, -[`HeOCR/hocrsyngen`](https://github.com/HeOCR/hocrsyngen) composes those -glyphs into synthetic Hebrew handwritten pages, which -[`HeOCR/hocrgen`](https://github.com/HeOCR/hocrgen) folds into -[`HeOCR/HeOCR`](https://github.com/HeOCR/HeOCR) and -[`HeOCR/HeOCRsynth`](https://github.com/HeOCR/HeOCRsynth). - -## Repository scope - -This repository contains the **code, schemas, and contracts** that produce -letter sets. It does **not** host the letter-set image data; that lives in -`HeOCR/hletterscript`. +[](https://github.com/HeOCR/hletterscriptgen/actions/workflows/ci.yml) + + + -What lives here: +Created by [Shay Palachy Affek](http://www.shaypalachy.com/). + +Generator framework for per-writer Hebrew handwritten letter-glyph image +sets. It turns rights-clean HASH scan records plus human-reviewed glyph +annotations into deterministic `letter_set.v1` outputs for +[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript). -- A Python package and CLI (`hletterscriptgen`). -- The `letter_set.v1` JSON Schema and a fixture example - (`examples/letter_set/writer_example.json`). -- Validation tooling (`hletterscriptgen validate`). -- CI that enforces schema and tooling invariants. -- Licensing policy and rights-carryover rules - ([`LICENSE-POLICY.md`](LICENSE-POLICY.md)). + -What does **not** live here: +## At a Glance -- Actual extracted glyph images (→ `HeOCR/hletterscript`). -- Page-scan ingestion or rights curation (→ - `HeOCR/public-domain-hand-written-hebrew-scans`). -- Document composition (→ `HeOCR/hocrsyngen`). -- Dataset orchestration, governance, release assembly, or publication (→ - `HeOCR/hocrgen` / `HeOCR/HeOCR` / `HeOCR/HeOCRsynth`). +| Field | Value | +| --- | --- | +| Role | Generate and validate per-writer Hebrew letter-set outputs | +| Input | HASH scan metadata, scan image files, and generation profiles | +| Output | `letter_set.v1` JSON plus cropped glyph PNG assets | +| Public contract | [`docs/letter_set_v1.md`](docs/letter_set_v1.md) | +| Example fixture | [`examples/letter_set/writer_example.json`](examples/letter_set/writer_example.json) | +| Main CLI | `hletterscriptgen` | +| Code license | MIT | +| Generated glyph rights | Per-variant rights inherited from upstream scans | -## Position in the HeOCR system +## What This Repository Owns -`hletterscriptgen` reads rights-clean scans from -`public-domain-hand-written-hebrew-scans`, produces per-writer letter -sets that land in `hletterscript`, and ultimately feeds `hocrsyngen` / -`hocrgen` / `HeOCR` / `HeOCRsynth`. See -[`docs/repository_scope.md`](docs/repository_scope.md) for the full -diagram and per-repo responsibilities. +This repository contains the Python package, CLI, schema, and validation +contracts for creating writer-level Hebrew letter sets. It does not host +the published glyph dataset itself; generated and curated data belongs in +[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript). + +What lives here: + +- `hletterscriptgen`, the Python package and command-line interface. +- The `letter_set.v1` JSON Schema and fixture example. +- Generation-profile parsing, upstream eligibility checks, glyph + extraction helpers, checksum calculation, and output validation. +- CI, release workflow, and rights-carryover policy. + +What lives elsewhere: + +- Page-scan ingestion and rights curation: + [`HeOCR/hash`](https://github.com/HeOCR/hash). +- Published letter-glyph image sets: + [`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript). +- Synthetic document composition: + [`HeOCR/hocrsyngen`](https://github.com/HeOCR/hocrsyngen). +- Dataset orchestration and release assembly: + [`HeOCR/hocrgen`](https://github.com/HeOCR/hocrgen), + [`HeOCR/HeOCR`](https://github.com/HeOCR/HeOCR), and + [`HeOCR/HeOCRsynth`](https://github.com/HeOCR/HeOCRsynth). + +## Pipeline Position + +```mermaid +flowchart LR + HASH["HeOCR/hashrights-clean scans"] --> PROFILE["generation profilewriter + glyph bboxes"] + PROFILE --> GEN["hletterscriptgencrop, hash, dedupe, validate"] + GEN --> DATA["HeOCR/hletterscriptletter_set.v1 + glyph PNGs"] + DATA --> SYN["HeOCR/hocrsyngensynthetic pages"] + SYN --> OCR["HeOCR / HeOCRsynthOCR and HTR datasets"] +``` + +See [`docs/repository_scope.md`](docs/repository_scope.md) for the full +ecosystem boundary map and per-repository responsibilities. ## Install @@ -54,7 +75,13 @@ diagram and per-repo responsibilities. python -m pip install -e ".[test]" ``` -Requires Python 3.11+. +For development: + +```bash +python -m pip install -e ".[dev]" +``` + +Requires Python 3.11 or newer. ## CLI @@ -62,34 +89,53 @@ Requires Python 3.11+. hletterscriptgen version hletterscriptgen schema --format json hletterscriptgen validate examples/letter_set/writer_example.json -hletterscriptgen validate examples/letter_set/writer_example.json --format json +hletterscriptgen check-eligible path/to/hash/data/index/entries.jsonl +hletterscriptgen scan-blobs path/to/scan.png --format json +hletterscriptgen generate --profile generate_profile.json --output ./out ``` -The `generate` subcommand is reserved for the upcoming extraction pipeline -and is not yet implemented; see [`docs/roadmap.md`](docs/roadmap.md). +The `generate` command expects a human-curated generation profile that +names writer IDs, upstream scan entries, and glyph bounding boxes. The +output is one directory per writer, each containing `letter_set.json` and +the surviving cropped glyph assets. + +## The `letter_set.v1` Contract -## The `letter_set.v1` contract +The bundled schema describes one writer's letter-glyph collection: -The bundled JSON Schema describes a per-writer letter set: +- `writer_id` identifies the writer set. +- `writer_provenance` records how the writer attribution was established. +- `upstream` pins the exact HASH revision used by the generation run. +- `letters` maps Hebrew letters and final forms to one or more glyph + variants. +- Each variant carries an asset path, checksum, image metadata, source + scan ID, bounding box, license, and rights evidence. +- `license_summary` summarizes the distinct variant-level licenses but + does not replace per-variant rights metadata. -- One document per **writer**. -- `letters` maps each Hebrew letter (base or final form, `U+05D0`–`U+05EA`) - to one or more **variants** extracted from upstream scans by that writer. -- Each variant carries an `asset_path`, a SHA-256 checksum, image metadata, - and **per-variant source rights**, so license evidence flows through from - upstream into any downstream composition. +Read the full contract in [`docs/letter_set_v1.md`](docs/letter_set_v1.md). -See [`docs/letter_set_v1.md`](docs/letter_set_v1.md) for the full -explanation and field-by-field notes. +## Validate Locally + +```bash +python -m ruff check . +python -m mypy +python -m pytest +hletterscriptgen validate examples/letter_set/writer_example.json +``` ## Licensing -- Code in this repository: MIT (see [`LICENSE`](LICENSE)). -- Generated letter sets: carry per-variant upstream rights — see - [`LICENSE-POLICY.md`](LICENSE-POLICY.md). The generator does not - relicense glyphs. +- Code in this repository is MIT licensed. See [`LICENSE`](LICENSE). +- Generated letter sets carry per-variant upstream rights. See + [`LICENSE-POLICY.md`](LICENSE-POLICY.md); the generator records rights + evidence but does not relicense glyphs. ## Contributing -See [`CONTRIBUTING.md`](CONTRIBUTING.md). For agent collaborators, see -[`AGENTS.md`](AGENTS.md). +See [`CONTRIBUTING.md`](CONTRIBUTING.md). Agent collaborators should also +read [`AGENTS.md`](AGENTS.md). + +## Credits + +Created by [Shay Palachy Affek ](http://www.shaypalachy.com/) [[GitHub](https://github.com/shaypal5)] diff --git a/SECURITY.md b/SECURITY.md index 6530e3b..07ae56d 100644 --- a/SECURITY.md +++ b/SECURITY.md @@ -33,7 +33,7 @@ please report it. Two paths, in preference order: privacy. For takedown of an upstream scan, report directly in -[`HeOCR/public-domain-hand-written-hebrew-scans`](https://github.com/HeOCR/public-domain-hand-written-hebrew-scans); +[`HeOCR/hash`](https://github.com/HeOCR/hash); once an upstream scan is removed or relicensed, regenerated letter sets must drop or update the affected variants. @@ -52,7 +52,7 @@ In scope: Out of scope here (report to the relevant upstream / downstream repo): -- Rights records on upstream scans — `HeOCR/public-domain-hand-written-hebrew-scans`. +- Rights records on upstream scans — `HeOCR/hash`. - Published letter-set datasets — `HeOCR/hletterscript`. - Composed synthetic pages — `HeOCR/hocrsyngen`. - Release-level governance — `HeOCR/hocrgen`. diff --git a/docs/README.md b/docs/README.md index 58f64e7..fe663ce 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,14 +1,20 @@ # Documentation index +Created by [Shay Palachy Affek](http://www.shaypalachy.com/). + - [Repository scope](repository_scope.md) — what this repo owns and what it does not, plus the canonical ecosystem diagram. - [Architecture](architecture.md) — code layout and the validation pipeline. - [`letter_set.v1` contract](letter_set_v1.md) — output schema, field-by-field. - [Upstream integration](upstream_integration.md) — how scans from - `public-domain-hand-written-hebrew-scans` feed in. + `hash` feed in. - [Downstream handoff](downstream_handoff.md) — how outputs land in `hletterscript` and onward. - [Roadmap](roadmap.md) — staged milestones beyond the scaffolding. ## Design drafts -- [Letter extraction pipeline (draft)](design/letter_extraction.md) — sketch of the future `generate` pipeline; nothing implemented yet. +- [Letter extraction pipeline](design/letter_extraction.md) — design notes for the `generate` pipeline and extraction workflow. + +## Credits + +Created by [Shay Palachy Affek ](http://www.shaypalachy.com/) [[GitHub](https://github.com/shaypal5)] diff --git a/docs/assets/hletterscriptgen-output-contract.png b/docs/assets/hletterscriptgen-output-contract.png new file mode 100644 index 0000000..4c52612 Binary files /dev/null and b/docs/assets/hletterscriptgen-output-contract.png differ diff --git a/docs/design/letter_extraction.md b/docs/design/letter_extraction.md index 90db963..44687eb 100644 --- a/docs/design/letter_extraction.md +++ b/docs/design/letter_extraction.md @@ -10,7 +10,7 @@ ## Goal Turn rights-clean handwritten Hebrew page scans (upstream: -`HeOCR/public-domain-hand-written-hebrew-scans`) into per-writer +`HeOCR/hash`) into per-writer `letter_set.v1` documents plus their referenced glyph image assets. ## Sketch diff --git a/docs/design/segmentation-approach.md b/docs/design/segmentation-approach.md index 3ceb0f0..fd0c104 100644 --- a/docs/design/segmentation-approach.md +++ b/docs/design/segmentation-approach.md @@ -19,16 +19,17 @@ path from Option A to Option B/C. ## Evidence from the upstream corpus -Investigation target: `HeOCR/public-domain-hand-written-hebrew-scans` (GitHub, inspected via -`gh api` — local clone not present at time of spike). +Investigation target: `HeOCR/hash` — HASH (Hebrew Archive of Scanned Handwriting) (GitHub, +inspected via `gh api` at spike time; corpus has grown considerably since). | Finding | Detail | |---------|--------| -| Total entries in `data/index/entries.jsonl` | 60 | -| `transcription.status` distribution | `"none"`: 60 / 60 | -| Non-null `alto_path` | 0 / 60 | -| Non-null `hocr_path` | 0 / 60 | -| Non-null `text_path` | 0 / 60 | +| Total entries in `data/index/entries.jsonl` (spike) | 60 | +| Total entries (as of 2026-05) | 373 (111 sources, 48 unique creators) | +| `transcription.status` distribution | `"none"`: all entries at spike time | +| Non-null `alto_path` | 0 at spike time | +| Non-null `hocr_path` | 0 at spike time | +| Non-null `text_path` | 0 at spike time | | Unique `files[].role` values across all entries | `"original"` only | | Scan directories inspected (`data/scans/`) | `commons__begani_netatikha` (representative sample) — contains only the JPEG scan, no sidecars | @@ -38,9 +39,9 @@ The file-role enum (`original`, `normalized`, `thumbnail`, `transcription`, `met `transcription` role, but zero entries exercise it. Source files consulted: -- `HeOCR/public-domain-hand-written-hebrew-scans/schemas/entry.schema.json` -- `HeOCR/public-domain-hand-written-hebrew-scans/data/index/entries.jsonl` -- `HeOCR/public-domain-hand-written-hebrew-scans/data/scans/commons__begani_netatikha/` +- `HeOCR/hash/schemas/entry.schema.json` +- `HeOCR/hash/data/index/entries.jsonl` +- `HeOCR/hash/data/scans/commons__begani_netatikha/` --- diff --git a/docs/design/writer_attribution.md b/docs/design/writer_attribution.md index f64df20..fa391d1 100644 --- a/docs/design/writer_attribution.md +++ b/docs/design/writer_attribution.md @@ -19,7 +19,7 @@ the local upstream checkout and declares one or more writer blocks. ```json { - "upstream_path": "../public-domain-hand-written-hebrew-scans", + "upstream_path": "../hash", "writers": [ { "writer_id": "writer_bialik", diff --git a/docs/letter_set_v1.md b/docs/letter_set_v1.md index 0c82bc4..6a8b240 100644 --- a/docs/letter_set_v1.md +++ b/docs/letter_set_v1.md @@ -24,7 +24,7 @@ is exercised by CI and must remain valid. }, "generated_at": "2026-05-12T00:00:00Z", "upstream": { - "repo": "HeOCR/public-domain-hand-written-hebrew-scans", + "repo": "HeOCR/hash", "revision": "" }, "letters": { @@ -58,7 +58,7 @@ labels only. If in doubt, omit. **Required.** Records how the writer identity was established and which upstream scan entries are attributed to them. `source_repo` is normally -`HeOCR/public-domain-hand-written-hebrew-scans`; `source_entry_ids` are +`HeOCR/hash`; `source_entry_ids` are the upstream `entries.jsonl` ids. `attribution_method` is a short tag (e.g. `collection_metadata`, `manual_review`, `fixture`). @@ -115,7 +115,7 @@ A mapping from a single Hebrew letter character (base or final form, | `asset_path` | POSIX path relative to the letter-set root. No leading `/` (schema-enforced); no `..` segment (cross-field-enforced). | | `checksum_sha256` | Lowercase SHA-256 hex digest of the asset bytes. Real letter sets must use real checksums; the example fixture's all-zero/all-one digests are intentional placeholders. | | `image.{width_px,height_px,format}` | Image metadata. `format` ∈ `png`, `webp`, `tiff`. | -| `source.scan_entry_id` | Upstream entry id (resolves in `public-domain-hand-written-hebrew-scans`). Cross-field validator checks it appears in `writer_provenance.source_entry_ids`. | +| `source.scan_entry_id` | Upstream entry id (resolves in `hash`). Cross-field validator checks it appears in `writer_provenance.source_entry_ids`. | | `source.scan_url` | Optional URL pointer to the source scan. RFC 3986 URI; checked when format-checking is enabled. | | `source.license` | One of the accepted SPDX / `LicenseRef-*` identifiers (see `$defs.license_id` in the schema). Extending the allow-list requires a schema change. | | `source.rights_evidence` | Optional free-form note or URL with rights evidence. | diff --git a/docs/repository_scope.md b/docs/repository_scope.md index 978d0e1..a6de1b2 100644 --- a/docs/repository_scope.md +++ b/docs/repository_scope.md @@ -7,7 +7,7 @@ intentionally narrow. ## Position in the HeOCR system (canonical) ``` -public-domain-hand-written-hebrew-scans (full-page scans, PD / CC / CC-BY) +hash (full-page scans, PD / CC / CC-BY) │ ▼ hletterscriptgen (code/framework — this repo) @@ -39,7 +39,7 @@ copy it — only one diagram should ever rot. | Concern | Where it lives | | --- | --- | -| Hosting page scans and rights records | `HeOCR/public-domain-hand-written-hebrew-scans` | +| Hosting page scans and rights records | `HeOCR/hash` | | Hosting per-writer letter-glyph datasets | `HeOCR/hletterscript` | | Composing synthetic Hebrew handwritten pages | `HeOCR/hocrsyngen` | | Dataset orchestration, governance, release assembly, publication | `HeOCR/hocrgen` | diff --git a/docs/upstream_integration.md b/docs/upstream_integration.md index b99c802..a371a61 100644 --- a/docs/upstream_integration.md +++ b/docs/upstream_integration.md @@ -1,7 +1,7 @@ # Upstream integration `hletterscriptgen` consumes scans from -[`HeOCR/public-domain-hand-written-hebrew-scans`](https://github.com/HeOCR/public-domain-hand-written-hebrew-scans). +[`HeOCR/hash`](https://github.com/HeOCR/hash) — HASH (Hebrew Archive of Scanned Handwriting). That upstream repo holds the authoritative rights records; this repo defers to them. diff --git a/examples/demo_candidate/demo_writer_0001/letter_set.json b/examples/demo_candidate/demo_writer_0001/letter_set.json new file mode 100644 index 0000000..7b89f01 --- /dev/null +++ b/examples/demo_candidate/demo_writer_0001/letter_set.json @@ -0,0 +1,846 @@ +{ + "schema_version": "letter_set.v1", + "writer_id": "demo_writer_0001", + "writer_label": "Demo Writer (synthetic glyphs — not a real person)", + "writer_provenance": { + "source_repo": "HeOCR/public-domain-hand-written-hebrew-scans", + "source_entry_ids": [ + "demo__manuscript_scan__p0001", + "demo__manuscript_scan__p0002", + "demo__manuscript_scan__p0003" + ], + "attribution_method": "fixture", + "notes": "Synthetic demo generated by scripts/make_demo_candidate.py." + }, + "generator": { + "name": "hletterscriptgen", + "version": "0.1.0.dev0", + "config_hash": "0000000000000000000000000000000000000000000000000000000000000000" + }, + "generated_at": "2026-05-25T00:00:00Z", + "upstream": { + "repo": "HeOCR/public-domain-hand-written-hebrew-scans", + "revision": "0000000000000000000000000000000000000000" + }, + "letters": { + "א": [ + { + "variant_id": "alef-0001", + "asset_path": "letters/alef/alef-0001.png", + "checksum_sha256": "2ef89ff0e905f64d5dd863006e007032e4bdb8f74f0b64bca43138e64da1e5c2", + "image": { + "width_px": 48, + "height_px": 56, + "format": "png" + }, + "quality": { + "ink_ratio": 0.074 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0001", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0001", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 62, + "y": 88, + "width": 48, + "height": 56 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 1, 48×56px)." + }, + { + "variant_id": "alef-0002", + "asset_path": "letters/alef/alef-0002.png", + "checksum_sha256": "b0a5281e5a0bbf5ae2ba7295b3239daa58c1d60c3bf2775e866a064ec97cf1d1", + "image": { + "width_px": 44, + "height_px": 52, + "format": "png" + }, + "quality": { + "ink_ratio": 0.1241 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0002", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0002", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 74, + "y": 96, + "width": 44, + "height": 52 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 2, 44×52px)." + }, + { + "variant_id": "alef-0003", + "asset_path": "letters/alef/alef-0003.png", + "checksum_sha256": "091f65d989cb7ec8ebdf43397a3c5bfa4988ea5cfe6e4123acaef9929bb18ba3", + "image": { + "width_px": 52, + "height_px": 60, + "format": "png" + }, + "quality": { + "ink_ratio": 0.0679 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0003", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0003", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 86, + "y": 104, + "width": 52, + "height": 60 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 3, 52×60px)." + } + ], + "ב": [ + { + "variant_id": "bet-0001", + "asset_path": "letters/bet/bet-0001.png", + "checksum_sha256": "05c6ee099fa57b110666933e7088044649fe296fca01e4802ef3e3e6391404a0", + "image": { + "width_px": 50, + "height_px": 40, + "format": "png" + }, + "quality": { + "ink_ratio": 0.0965 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0001", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0001", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 62, + "y": 88, + "width": 50, + "height": 40 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 1, 50×40px)." + }, + { + "variant_id": "bet-0002", + "asset_path": "letters/bet/bet-0002.png", + "checksum_sha256": "0c0f8760e89ffb96ba835f025ada634bb6df3d986be000ee405d2feb4a4ceccd", + "image": { + "width_px": 46, + "height_px": 44, + "format": "png" + }, + "quality": { + "ink_ratio": 0.1388 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0002", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0002", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 74, + "y": 96, + "width": 46, + "height": 44 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 2, 46×44px)." + } + ], + "ג": [ + { + "variant_id": "gimel-0001", + "asset_path": "letters/gimel/gimel-0001.png", + "checksum_sha256": "9a89d53ce670ac3a4e3ac8eebb97233f03bdc9098daac8a59d9233260179f5bf", + "image": { + "width_px": 44, + "height_px": 50, + "format": "png" + }, + "quality": { + "ink_ratio": 0.0645 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0001", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0001", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 62, + "y": 88, + "width": 44, + "height": 50 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 1, 44×50px)." + }, + { + "variant_id": "gimel-0002", + "asset_path": "letters/gimel/gimel-0002.png", + "checksum_sha256": "c304c1e71a99be9c6651188c4fdb81a3e1513d04b83c5d7a32e58ed247808ea4", + "image": { + "width_px": 48, + "height_px": 48, + "format": "png" + }, + "quality": { + "ink_ratio": 0.0964 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0002", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0002", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 74, + "y": 96, + "width": 48, + "height": 48 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 2, 48×48px)." + } + ], + "ד": [ + { + "variant_id": "dalet-0001", + "asset_path": "letters/dalet/dalet-0001.png", + "checksum_sha256": "a6c825edf23804e605c1dc61ec4fd66558fb240faddf94701ae9ad3bdb8ae269", + "image": { + "width_px": 46, + "height_px": 40, + "format": "png" + }, + "quality": { + "ink_ratio": 0.063 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0001", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0001", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 62, + "y": 88, + "width": 46, + "height": 40 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 1, 46×40px)." + } + ], + "ה": [ + { + "variant_id": "he-0001", + "asset_path": "letters/he/he-0001.png", + "checksum_sha256": "f940e276f17510122bd53c1f85417a002f75cc28081f9ce0170089de916a1ea7", + "image": { + "width_px": 50, + "height_px": 42, + "format": "png" + }, + "quality": { + "ink_ratio": 0.0781 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0001", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0001", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 62, + "y": 88, + "width": 50, + "height": 42 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 1, 50×42px)." + }, + { + "variant_id": "he-0002", + "asset_path": "letters/he/he-0002.png", + "checksum_sha256": "1283d973075e5b71a4d05210e50edbdf56d64d71e684effc034ea12c0e2910ae", + "image": { + "width_px": 46, + "height_px": 46, + "format": "png" + }, + "quality": { + "ink_ratio": 0.1191 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0002", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0002", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 74, + "y": 96, + "width": 46, + "height": 46 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 2, 46×46px)." + } + ], + "ו": [ + { + "variant_id": "vav-0001", + "asset_path": "letters/vav/vav-0001.png", + "checksum_sha256": "abe5e6e52a13c167275d7715fec3e8073202d3ce5f748b97ba218c88527c6ae7", + "image": { + "width_px": 30, + "height_px": 48, + "format": "png" + }, + "quality": { + "ink_ratio": 0.0611 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0001", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0001", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 62, + "y": 88, + "width": 30, + "height": 48 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 1, 30×48px)." + }, + { + "variant_id": "vav-0002", + "asset_path": "letters/vav/vav-0002.png", + "checksum_sha256": "bc70358bbf5b103196eb039f7122fa9ac871c0fac52c37c96332fe545b1ab6bb", + "image": { + "width_px": 28, + "height_px": 52, + "format": "png" + }, + "quality": { + "ink_ratio": 0.0927 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0002", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0002", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 74, + "y": 96, + "width": 28, + "height": 52 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 2, 28×52px)." + } + ], + "ז": [ + { + "variant_id": "zayin-0001", + "asset_path": "letters/zayin/zayin-0001.png", + "checksum_sha256": "2f4cc7416a831f3ffd86eb8fd76566d2e2728c0c7617a62d1f562a9e1fcb8c83", + "image": { + "width_px": 38, + "height_px": 44, + "format": "png" + }, + "quality": { + "ink_ratio": 0.0813 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0001", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0001", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 62, + "y": 88, + "width": 38, + "height": 44 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 1, 38×44px)." + } + ], + "מ": [ + { + "variant_id": "mem-0001", + "asset_path": "letters/mem/mem-0001.png", + "checksum_sha256": "91bb63433fd11ee59095f8aff963cd2ea1f1797cd75aec98b98c4bc8014cbb68", + "image": { + "width_px": 50, + "height_px": 44, + "format": "png" + }, + "quality": { + "ink_ratio": 0.1068 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0001", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0001", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 62, + "y": 88, + "width": 50, + "height": 44 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 1, 50×44px)." + }, + { + "variant_id": "mem-0002", + "asset_path": "letters/mem/mem-0002.png", + "checksum_sha256": "7672722274950c767be5416b6c27d518ff4ba7179028175d3d438f0211b129b2", + "image": { + "width_px": 54, + "height_px": 48, + "format": "png" + }, + "quality": { + "ink_ratio": 0.1501 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0002", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0002", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 74, + "y": 96, + "width": 54, + "height": 48 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 2, 54×48px)." + } + ], + "נ": [ + { + "variant_id": "nun-0001", + "asset_path": "letters/nun/nun-0001.png", + "checksum_sha256": "e2ffc93a2278c25afcd4bec213f5806066e93aace9d5234459a427ae26249e9c", + "image": { + "width_px": 44, + "height_px": 50, + "format": "png" + }, + "quality": { + "ink_ratio": 0.0677 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0001", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0001", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 62, + "y": 88, + "width": 44, + "height": 50 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 1, 44×50px)." + }, + { + "variant_id": "nun-0002", + "asset_path": "letters/nun/nun-0002.png", + "checksum_sha256": "98c3b67939c7e9a1df6cbc43e19ee34f59be4a7dbd3ad10bbed0af5cc21bb172", + "image": { + "width_px": 40, + "height_px": 48, + "format": "png" + }, + "quality": { + "ink_ratio": 0.1141 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0002", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0002", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 74, + "y": 96, + "width": 40, + "height": 48 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 2, 40×48px)." + } + ], + "ס": [ + { + "variant_id": "samekh-0001", + "asset_path": "letters/samekh/samekh-0001.png", + "checksum_sha256": "dc1d4523769d69885f9beed8f076a5f7f8f087a15111ae3ead334c402f56bbdc", + "image": { + "width_px": 46, + "height_px": 46, + "format": "png" + }, + "quality": { + "ink_ratio": 0.1248 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0001", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0001", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 62, + "y": 88, + "width": 46, + "height": 46 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 1, 46×46px)." + } + ], + "ע": [ + { + "variant_id": "ayin-0001", + "asset_path": "letters/ayin/ayin-0001.png", + "checksum_sha256": "0799c0a6530b682bc0088ba09c6e7a9f89f7d1f807ef0c721a6c96e7456c9a31", + "image": { + "width_px": 50, + "height_px": 50, + "format": "png" + }, + "quality": { + "ink_ratio": 0.0748 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0001", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0001", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 62, + "y": 88, + "width": 50, + "height": 50 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 1, 50×50px)." + }, + { + "variant_id": "ayin-0002", + "asset_path": "letters/ayin/ayin-0002.png", + "checksum_sha256": "ff2b19b4236d0adb23fe7531d88a41fd8e84f6ac67683d23fe4e2d263c670fde", + "image": { + "width_px": 46, + "height_px": 48, + "format": "png" + }, + "quality": { + "ink_ratio": 0.1277 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0002", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0002", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 74, + "y": 96, + "width": 46, + "height": 48 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 2, 46×48px)." + } + ], + "פ": [ + { + "variant_id": "pe-0001", + "asset_path": "letters/pe/pe-0001.png", + "checksum_sha256": "f6466d14e85853dcd67dea5fd18959139cb82303462b9434bfd4e28bbbe1d2b2", + "image": { + "width_px": 48, + "height_px": 52, + "format": "png" + }, + "quality": { + "ink_ratio": 0.0869 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0001", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0001", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 62, + "y": 88, + "width": 48, + "height": 52 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 1, 48×52px)." + }, + { + "variant_id": "pe-0002", + "asset_path": "letters/pe/pe-0002.png", + "checksum_sha256": "4f49f0b48ed34f00b79d476c26571f8fff9198a88662745fa570d4a042b0ae9e", + "image": { + "width_px": 44, + "height_px": 48, + "format": "png" + }, + "quality": { + "ink_ratio": 0.1402 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0002", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0002", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 74, + "y": 96, + "width": 44, + "height": 48 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 2, 44×48px)." + } + ], + "ר": [ + { + "variant_id": "resh-0001", + "asset_path": "letters/resh/resh-0001.png", + "checksum_sha256": "d26966328e67060e18b24e257fdf5389909819a22e1bf8b47eb491bd67311ffa", + "image": { + "width_px": 46, + "height_px": 44, + "format": "png" + }, + "quality": { + "ink_ratio": 0.0613 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0001", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0001", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 62, + "y": 88, + "width": 46, + "height": 44 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 1, 46×44px)." + }, + { + "variant_id": "resh-0002", + "asset_path": "letters/resh/resh-0002.png", + "checksum_sha256": "e6a6514e94d3371b16f12e092b10ab3f6896abb1252cf85e430f1962c78d95db", + "image": { + "width_px": 42, + "height_px": 46, + "format": "png" + }, + "quality": { + "ink_ratio": 0.0963 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0002", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0002", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 74, + "y": 96, + "width": 42, + "height": 46 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 2, 42×46px)." + } + ], + "ש": [ + { + "variant_id": "shin-0001", + "asset_path": "letters/shin/shin-0001.png", + "checksum_sha256": "f61026afe0c2642a1ba2532da56056d6b32c67c128f639db283c9a3b67769dad", + "image": { + "width_px": 54, + "height_px": 48, + "format": "png" + }, + "quality": { + "ink_ratio": 0.103 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0001", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0001", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 62, + "y": 88, + "width": 54, + "height": 48 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 1, 54×48px)." + }, + { + "variant_id": "shin-0002", + "asset_path": "letters/shin/shin-0002.png", + "checksum_sha256": "292fef5dd799e31a440bff0fc0e9cfeb4177d9f95c6159f9302b08e8171ccd90", + "image": { + "width_px": 50, + "height_px": 52, + "format": "png" + }, + "quality": { + "ink_ratio": 0.1635 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0002", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0002", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 74, + "y": 96, + "width": 50, + "height": 52 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 2, 50×52px)." + }, + { + "variant_id": "shin-0003", + "asset_path": "letters/shin/shin-0003.png", + "checksum_sha256": "1e4e82355d65d9488717f3ab0043039979e45d216049efb97705fe21683817fc", + "image": { + "width_px": 56, + "height_px": 44, + "format": "png" + }, + "quality": { + "ink_ratio": 0.1035 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0003", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0003", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 86, + "y": 104, + "width": 56, + "height": 44 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 3, 56×44px)." + } + ], + "ת": [ + { + "variant_id": "tav-0001", + "asset_path": "letters/tav/tav-0001.png", + "checksum_sha256": "cdb39a8ca5008b107ead9a5d9267fa4881febb98dc57d4fe158a090d4c7810dd", + "image": { + "width_px": 50, + "height_px": 46, + "format": "png" + }, + "quality": { + "ink_ratio": 0.0835 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0001", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0001", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 62, + "y": 88, + "width": 50, + "height": 46 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 1, 50×46px)." + }, + { + "variant_id": "tav-0002", + "asset_path": "letters/tav/tav-0002.png", + "checksum_sha256": "2385d3de5f7ff01eb13180bb86d4ee9cecd045ec954c75ab18ca647fa0910f7c", + "image": { + "width_px": 48, + "height_px": 50, + "format": "png" + }, + "quality": { + "ink_ratio": 0.1237 + }, + "source": { + "scan_entry_id": "demo__manuscript_scan__p0002", + "scan_url": "https://example.invalid/scans/demo__manuscript_scan__p0002", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": { + "x": 74, + "y": 96, + "width": 48, + "height": 50 + } + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": "Synthetic demo glyph (variant 2, 48×50px)." + } + ] + }, + "license_summary": { + "licenses": [ + "PDM-1.0" + ], + "notes": "All variants are synthetic demo fixtures under PDM-1.0." + } +} \ No newline at end of file diff --git a/examples/demo_candidate/demo_writer_0001/letters/alef/alef-0001.png b/examples/demo_candidate/demo_writer_0001/letters/alef/alef-0001.png new file mode 100644 index 0000000..dbb364e Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/alef/alef-0001.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/alef/alef-0002.png b/examples/demo_candidate/demo_writer_0001/letters/alef/alef-0002.png new file mode 100644 index 0000000..cc3f6b8 Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/alef/alef-0002.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/alef/alef-0003.png b/examples/demo_candidate/demo_writer_0001/letters/alef/alef-0003.png new file mode 100644 index 0000000..7045b8b Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/alef/alef-0003.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/ayin/ayin-0001.png b/examples/demo_candidate/demo_writer_0001/letters/ayin/ayin-0001.png new file mode 100644 index 0000000..caa0bad Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/ayin/ayin-0001.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/ayin/ayin-0002.png b/examples/demo_candidate/demo_writer_0001/letters/ayin/ayin-0002.png new file mode 100644 index 0000000..3eec4d9 Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/ayin/ayin-0002.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/bet/bet-0001.png b/examples/demo_candidate/demo_writer_0001/letters/bet/bet-0001.png new file mode 100644 index 0000000..2b55acf Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/bet/bet-0001.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/bet/bet-0002.png b/examples/demo_candidate/demo_writer_0001/letters/bet/bet-0002.png new file mode 100644 index 0000000..da0e905 Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/bet/bet-0002.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/dalet/dalet-0001.png b/examples/demo_candidate/demo_writer_0001/letters/dalet/dalet-0001.png new file mode 100644 index 0000000..098f725 Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/dalet/dalet-0001.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/gimel/gimel-0001.png b/examples/demo_candidate/demo_writer_0001/letters/gimel/gimel-0001.png new file mode 100644 index 0000000..cf8c9e8 Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/gimel/gimel-0001.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/gimel/gimel-0002.png b/examples/demo_candidate/demo_writer_0001/letters/gimel/gimel-0002.png new file mode 100644 index 0000000..7340a64 Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/gimel/gimel-0002.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/he/he-0001.png b/examples/demo_candidate/demo_writer_0001/letters/he/he-0001.png new file mode 100644 index 0000000..a61126c Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/he/he-0001.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/he/he-0002.png b/examples/demo_candidate/demo_writer_0001/letters/he/he-0002.png new file mode 100644 index 0000000..e972583 Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/he/he-0002.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/mem/mem-0001.png b/examples/demo_candidate/demo_writer_0001/letters/mem/mem-0001.png new file mode 100644 index 0000000..9a8e7df Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/mem/mem-0001.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/mem/mem-0002.png b/examples/demo_candidate/demo_writer_0001/letters/mem/mem-0002.png new file mode 100644 index 0000000..6f9621d Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/mem/mem-0002.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/nun/nun-0001.png b/examples/demo_candidate/demo_writer_0001/letters/nun/nun-0001.png new file mode 100644 index 0000000..ce6e38b Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/nun/nun-0001.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/nun/nun-0002.png b/examples/demo_candidate/demo_writer_0001/letters/nun/nun-0002.png new file mode 100644 index 0000000..6e90472 Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/nun/nun-0002.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/pe/pe-0001.png b/examples/demo_candidate/demo_writer_0001/letters/pe/pe-0001.png new file mode 100644 index 0000000..95f0c0e Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/pe/pe-0001.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/pe/pe-0002.png b/examples/demo_candidate/demo_writer_0001/letters/pe/pe-0002.png new file mode 100644 index 0000000..ce8a2df Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/pe/pe-0002.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/resh/resh-0001.png b/examples/demo_candidate/demo_writer_0001/letters/resh/resh-0001.png new file mode 100644 index 0000000..0491d5c Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/resh/resh-0001.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/resh/resh-0002.png b/examples/demo_candidate/demo_writer_0001/letters/resh/resh-0002.png new file mode 100644 index 0000000..3b57860 Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/resh/resh-0002.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/samekh/samekh-0001.png b/examples/demo_candidate/demo_writer_0001/letters/samekh/samekh-0001.png new file mode 100644 index 0000000..7218826 Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/samekh/samekh-0001.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/shin/shin-0001.png b/examples/demo_candidate/demo_writer_0001/letters/shin/shin-0001.png new file mode 100644 index 0000000..bcc7d1e Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/shin/shin-0001.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/shin/shin-0002.png b/examples/demo_candidate/demo_writer_0001/letters/shin/shin-0002.png new file mode 100644 index 0000000..f648192 Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/shin/shin-0002.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/shin/shin-0003.png b/examples/demo_candidate/demo_writer_0001/letters/shin/shin-0003.png new file mode 100644 index 0000000..3e80489 Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/shin/shin-0003.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/tav/tav-0001.png b/examples/demo_candidate/demo_writer_0001/letters/tav/tav-0001.png new file mode 100644 index 0000000..4b90db0 Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/tav/tav-0001.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/tav/tav-0002.png b/examples/demo_candidate/demo_writer_0001/letters/tav/tav-0002.png new file mode 100644 index 0000000..7b9575c Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/tav/tav-0002.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/vav/vav-0001.png b/examples/demo_candidate/demo_writer_0001/letters/vav/vav-0001.png new file mode 100644 index 0000000..6de868c Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/vav/vav-0001.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/vav/vav-0002.png b/examples/demo_candidate/demo_writer_0001/letters/vav/vav-0002.png new file mode 100644 index 0000000..0db6af5 Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/vav/vav-0002.png differ diff --git a/examples/demo_candidate/demo_writer_0001/letters/zayin/zayin-0001.png b/examples/demo_candidate/demo_writer_0001/letters/zayin/zayin-0001.png new file mode 100644 index 0000000..76e1d55 Binary files /dev/null and b/examples/demo_candidate/demo_writer_0001/letters/zayin/zayin-0001.png differ diff --git a/examples/letter_set/writer_example.json b/examples/letter_set/writer_example.json index 58ca41f..ae5fed2 100644 --- a/examples/letter_set/writer_example.json +++ b/examples/letter_set/writer_example.json @@ -3,7 +3,7 @@ "writer_id": "example-writer-0001", "writer_label": "Example Writer (fixture only — not a real person)", "writer_provenance": { - "source_repo": "HeOCR/public-domain-hand-written-hebrew-scans", + "source_repo": "HeOCR/hash", "source_entry_ids": ["example-scan-0001", "example-scan-0002"], "attribution_method": "fixture", "notes": "Fixture document used to validate the letter_set.v1 schema in CI. No real glyph images are referenced." @@ -15,7 +15,7 @@ }, "generated_at": "2026-05-12T00:00:00Z", "upstream": { - "repo": "HeOCR/public-domain-hand-written-hebrew-scans", + "repo": "HeOCR/hash", "revision": "0000000000000000000000000000000000000000" }, "letters": { diff --git a/pyproject.toml b/pyproject.toml index be3f9e3..a9580f9 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -79,6 +79,10 @@ select = [ [tool.ruff.lint.per-file-ignores] "tests/*" = ["B"] +# reviewer.py uses Hebrew characters as intentional dict keys — RUF001 false positives. +"src/hletterscriptgen/reviewer.py" = ["RUF001"] +# make_demo_candidate.py uses Hebrew characters as intentional dict keys and docstring content. +"scripts/make_demo_candidate.py" = ["RUF001", "RUF002"] [tool.mypy] python_version = "3.11" diff --git a/scripts/make_demo_candidate.py b/scripts/make_demo_candidate.py new file mode 100644 index 0000000..8b47669 --- /dev/null +++ b/scripts/make_demo_candidate.py @@ -0,0 +1,511 @@ +#!/usr/bin/env python3 +"""Generate a synthetic demo release candidate for reviewing with the review UI. + +Creates: + + examples/demo_candidate/ + demo_writer_0001/ + letter_set.json — schema-valid letter_set.v1 document + letters/ + / + .png — synthetic glyph images (cv2-rendered) + +Usage:: + + python3 scripts/make_demo_candidate.py [--output DIR] + +Requires the ``cv`` extra (opencv-python-headless):: + + pip install -e ".[cv]" +""" +from __future__ import annotations + +import argparse +import hashlib +import json +import struct +import zlib +from pathlib import Path + +REPO_ROOT = Path(__file__).resolve().parent.parent +DEFAULT_OUT = REPO_ROOT / "examples" / "demo_candidate" + +# Import version from the installed package so the fixture stays accurate. +try: + from hletterscriptgen import __version__ as _VERSION +except ImportError: + _VERSION = "unknown" + +# --------------------------------------------------------------------------- +# Minimal stdlib-only PNG encoder +# --------------------------------------------------------------------------- + + +def _png_from_pixels(pixels: list[list[int]]) -> bytes: + """Encode a 2-D list of 0-255 grayscale values as a PNG (stdlib only).""" + height = len(pixels) + width = len(pixels[0]) if height else 0 + + # Each row is prefixed by a filter byte (0 = None). + raw = b"".join(b"\x00" + bytes(row) for row in pixels) + compressed = zlib.compress(raw, 9) + + def chunk(tag: bytes, data: bytes) -> bytes: + crc = zlib.crc32(tag + data) & 0xFFFF_FFFF + return struct.pack(">I", len(data)) + tag + data + struct.pack(">I", crc) + + # IHDR: width(4) height(4) bit-depth(1) color-type(1=grayscale=0) + # compression(1=0) filter(1=0) interlace(1=0) + ihdr_data = struct.pack(">II", width, height) + bytes([8, 0, 0, 0, 0]) + + return ( + b"\x89PNG\r\n\x1a\n" + + chunk(b"IHDR", ihdr_data) + + chunk(b"IDAT", compressed) + + chunk(b"IEND", b"") + ) + + +# --------------------------------------------------------------------------- +# Letter shape primitives +# --------------------------------------------------------------------------- + +WHITE = 255 +BLACK = 0 + + +def _blank(w: int, h: int) -> list[list[int]]: + return [[WHITE] * w for _ in range(h)] + + +def _hline(px: list[list[int]], y: int, x0: int, x1: int, t: int = 2) -> None: + """Draw a horizontal line.""" + h, w = len(px), len(px[0]) + for dy in range(t): + row = y + dy + if 0 <= row < h: + for x in range(max(0, x0), min(w, x1 + 1)): + px[row][x] = BLACK + + +def _vline(px: list[list[int]], x: int, y0: int, y1: int, t: int = 2) -> None: + """Draw a vertical line.""" + h, w = len(px), len(px[0]) + for dx in range(t): + col = x + dx + if 0 <= col < w: + for y in range(max(0, y0), min(h, y1 + 1)): + px[y][col] = BLACK + + +def _diag(px: list[list[int]], x0: int, y0: int, x1: int, y1: int, t: int = 2) -> None: + """Draw a straight line via integer Bresenham.""" + dx = abs(x1 - x0) + dy = abs(y1 - y0) + sx = 1 if x1 > x0 else -1 + sy = 1 if y1 > y0 else -1 + h, w = len(px), len(px[0]) + err = dx - dy + x, y = x0, y0 + while True: + for ox in range(t): + for oy in range(t): + nx, ny = x + ox, y + oy + if 0 <= nx < w and 0 <= ny < h: + px[ny][nx] = BLACK + if x == x1 and y == y1: + break + e2 = 2 * err + if e2 > -dy: + err -= dy + x += sx + if e2 < dx: + err += dx + y += sy + + +# --------------------------------------------------------------------------- +# Per-letter shape generators +# Any function here accepts (width, height, variant_index) and returns +# a PNG bytes object. +# --------------------------------------------------------------------------- + +def _shape_alef(w: int, h: int, v: int) -> bytes: + """Alef (א): diagonal cross + horizontal bar.""" + px = _blank(w, h) + t = max(1, 2 + (v % 2)) + # Main diagonal top-right to bottom-left + _diag(px, w - w // 5, h // 6, w // 5, h - h // 5, t) + # Left fork: bottom-left up to centre + _diag(px, w // 5, h - h // 5, w // 3, h // 2, t) + # Right fork: top-right down to centre + _diag(px, w - w // 5, h // 6, 2 * w // 3, h // 2, t) + # Small horizontal bar at centre-left + _hline(px, h // 2, w // 4, w // 2, t) + return _png_from_pixels(px) + + +def _shape_bet(w: int, h: int, v: int) -> bytes: + """Bet (ב): top horizontal + right vertical + bottom bar.""" + px = _blank(w, h) + t = max(1, 2 + (v % 2)) + _hline(px, h // 6, w // 6, w - w // 6, t) # top + _vline(px, w - w // 5, h // 6, h - h // 5, t) # right + _hline(px, h - h // 5, w // 5, w - w // 5, t) # bottom + # Tiny left foot + _vline(px, w // 6, h // 6, h // 3, t) + return _png_from_pixels(px) + + +def _shape_gimel(w: int, h: int, v: int) -> bytes: + """Gimel (ג): right vertical + top horizontal + right-down hook.""" + px = _blank(w, h) + t = max(1, 2 + (v % 2)) + _vline(px, w - w // 4, h // 6, h - h // 4, t) # right vertical + _hline(px, h // 6, w // 5, w - w // 4, t) # top horizontal + # Hook at bottom-right going down-left + _diag(px, w - w // 4, h - h // 4, w // 2, h - h // 8, t) + return _png_from_pixels(px) + + +def _shape_dalet(w: int, h: int, v: int) -> bytes: + """Dalet (ד): top horizontal + right vertical (no foot).""" + px = _blank(w, h) + t = max(1, 2 + (v % 2)) + _hline(px, h // 6, w // 6, w - w // 6, t) # top + _vline(px, w - w // 5, h // 6, h - h // 5, t) # right + return _png_from_pixels(px) + + +def _shape_he(w: int, h: int, v: int) -> bytes: + """He (ה): dalet + detached left vertical.""" + px = _blank(w, h) + t = max(1, 2 + (v % 2)) + _hline(px, h // 6, w // 6, w - w // 6, t) + _vline(px, w - w // 5, h // 6, h - h // 5, t) + # Detached left vertical (doesn't touch top) + _vline(px, w // 5, h // 3, h - h // 5, t) + return _png_from_pixels(px) + + +def _shape_vav(w: int, h: int, v: int) -> bytes: + """Vav (ו): short cap + descending vertical.""" + px = _blank(w, h) + t = max(1, 2 + (v % 2)) + # Cap + _hline(px, h // 8, w // 3, 2 * w // 3, t) + _vline(px, w // 2 - t // 2, h // 8, 5 * h // 6, t) + return _png_from_pixels(px) + + +def _shape_zayin(w: int, h: int, v: int) -> bytes: + """Zayin (ז): top bar (long) + short descender from right.""" + px = _blank(w, h) + t = max(1, 2 + (v % 2)) + _hline(px, h // 8, w // 6, 5 * w // 6, t + 1) # wide cap + _vline(px, 3 * w // 4, h // 8, 5 * h // 6, t) # right descender + return _png_from_pixels(px) + + +def _shape_mem(w: int, h: int, v: int) -> bytes: + """Mem (מ): closed square with left opening.""" + px = _blank(w, h) + t = max(1, 2 + (v % 2)) + _hline(px, h // 8, w // 5, w - w // 5, t) # top + _vline(px, w - w // 5, h // 8, h - h // 5, t) # right + _hline(px, h - h // 5, w // 5, w - w // 5, t) # bottom + _vline(px, w // 5, h // 4, h - h // 5, t) # left (partial, open at top) + return _png_from_pixels(px) + + +def _shape_nun(w: int, h: int, v: int) -> bytes: + """Nun (נ): hook with descender.""" + px = _blank(w, h) + t = max(1, 2 + (v % 2)) + _hline(px, h // 8, w // 4, 3 * w // 4, t) + _vline(px, 3 * w // 4, h // 8, h // 2, t) + # Descending diagonal from hook + _diag(px, 3 * w // 4, h // 2, w // 4, h - h // 8, t) + return _png_from_pixels(px) + + +def _shape_samekh(w: int, h: int, v: int) -> bytes: + """Samekh (ס): closed rectangle.""" + px = _blank(w, h) + t = max(1, 2 + (v % 2)) + _hline(px, h // 8, w // 6, 5 * w // 6, t) + _hline(px, h - h // 8, w // 6, 5 * w // 6, t) + _vline(px, w // 6, h // 8, h - h // 8, t) + _vline(px, 5 * w // 6 - t, h // 8, h - h // 8, t) + return _png_from_pixels(px) + + +def _shape_ayin(w: int, h: int, v: int) -> bytes: + """Ayin (ע): two diagonals meeting at bottom.""" + px = _blank(w, h) + t = max(1, 2 + (v % 2)) + cx = w // 2 + bot = h - h // 8 + _diag(px, w // 6, h // 8, cx, bot, t) + _diag(px, 5 * w // 6, h // 8, cx, bot, t) + return _png_from_pixels(px) + + +def _shape_pe(w: int, h: int, v: int) -> bytes: + """Pe (פ): circular top + descender.""" + px = _blank(w, h) + t = max(1, 2 + (v % 2)) + _hline(px, h // 8, w // 5, 4 * w // 5, t) + _vline(px, 4 * w // 5, h // 8, h // 2, t) + _hline(px, h // 2, w // 4, 4 * w // 5, t) + _vline(px, w // 4, h // 4, h - h // 8, t) + return _png_from_pixels(px) + + +def _shape_resh(w: int, h: int, v: int) -> bytes: + """Resh (ר): top bar + right vertical descender.""" + px = _blank(w, h) + t = max(1, 2 + (v % 2)) + _hline(px, h // 8, w // 5, 4 * w // 5, t) + _vline(px, 4 * w // 5, h // 8, h - h // 8, t) + return _png_from_pixels(px) + + +def _shape_shin(w: int, h: int, v: int) -> bytes: + """Shin (ש): three prongs.""" + px = _blank(w, h) + t = max(1, 2 + (v % 2)) + bot = h - h // 6 + _vline(px, w // 5, h // 8, bot, t) + _vline(px, w // 2 - t // 2, h // 5, bot, t) + _vline(px, 4 * w // 5, h // 8, bot, t) + _hline(px, bot, w // 5, 4 * w // 5, t) + return _png_from_pixels(px) + + +def _shape_tav(w: int, h: int, v: int) -> bytes: + """Tav (ת): top bar + left & right descenders, right foot.""" + px = _blank(w, h) + t = max(1, 2 + (v % 2)) + _hline(px, h // 8, w // 6, 5 * w // 6, t) + _vline(px, w // 6, h // 8, h - h // 5, t) + _vline(px, 5 * w // 6, h // 8, h - h // 3, t) + _hline(px, h - h // 3, 5 * w // 6, 5 * w // 6 + t + 3, t) # right foot + return _png_from_pixels(px) + + +# Mapping: Unicode char → shape function +_SHAPES = { + "א": _shape_alef, + "ב": _shape_bet, + "ג": _shape_gimel, + "ד": _shape_dalet, + "ה": _shape_he, + "ו": _shape_vav, + "ז": _shape_zayin, + "מ": _shape_mem, + "נ": _shape_nun, + "ס": _shape_samekh, + "ע": _shape_ayin, + "פ": _shape_pe, + "ר": _shape_resh, + "ש": _shape_shin, + "ת": _shape_tav, +} + +# (letter_char, letter_name, variants): each variant is a (width, height) tuple. +_DEMO_LETTERS: list[tuple[str, str, list[tuple[int, int]]]] = [ + ("א", "alef", [(48, 56), (44, 52), (52, 60)]), + ("ב", "bet", [(50, 40), (46, 44)]), + ("ג", "gimel", [(44, 50), (48, 48)]), + ("ד", "dalet", [(46, 40)]), + ("ה", "he", [(50, 42), (46, 46)]), + ("ו", "vav", [(30, 48), (28, 52)]), + ("ז", "zayin", [(38, 44)]), + ("מ", "mem", [(50, 44), (54, 48)]), + ("נ", "nun", [(44, 50), (40, 48)]), + ("ס", "samekh", [(46, 46)]), + ("ע", "ayin", [(50, 50), (46, 48)]), + ("פ", "pe", [(48, 52), (44, 48)]), + ("ר", "resh", [(46, 44), (42, 46)]), + ("ש", "shin", [(54, 48), (50, 52), (56, 44)]), + ("ת", "tav", [(50, 46), (48, 50)]), +] + + +# --------------------------------------------------------------------------- +# Ink-ratio computation +# +# extractor.compute_ink_ratio operates on an OpenCV numpy array, so it cannot +# be called here (this script is intentionally stdlib-only). The logic below +# is equivalent for the grayscale PNGs produced by _png_from_pixels(). +# --------------------------------------------------------------------------- + + +def _ink_ratio(png_bytes: bytes) -> float: + """Parse a grayscale PNG and compute ink fraction (pixels < 128 / total).""" + data = png_bytes + + def _read_chunk(pos: int) -> tuple[bytes, bytes, int]: + length = struct.unpack_from(">I", data, pos)[0] + tag = data[pos + 4 : pos + 8] + chunk_data = data[pos + 8 : pos + 8 + length] + return tag, chunk_data, pos + 12 + length + + # Parse IHDR + pos = 8 # skip signature + tag, ihdr, pos = _read_chunk(pos) + width, height = struct.unpack_from(">II", ihdr) + + # Collect IDAT chunks + idat_raw = b"" + while pos < len(data): + tag, cdata, pos = _read_chunk(pos) + if tag == b"IDAT": + idat_raw += cdata + elif tag == b"IEND": + break + + raw = zlib.decompress(idat_raw) + # Each row: 1 filter byte + width bytes + ink = 0 + total = width * height + for row in range(height): + row_data = raw[row * (width + 1) + 1 : row * (width + 1) + 1 + width] + ink += sum(1 for b in row_data if b < 128) + return ink / total if total else 0.0 + + +# --------------------------------------------------------------------------- +# Release candidate builder +# --------------------------------------------------------------------------- + + +def _letter_set_name(name: str) -> str: + """Convert letter name to letter_set asset path fragment.""" + return name.replace("_", "-") + + +def build_demo_candidate(out_dir: Path) -> Path: + """Build the demo release candidate tree under *out_dir*. + + Returns the path to the ``letter_set.json`` file. + """ + writer_id = "demo_writer_0001" + writer_dir = out_dir / writer_id + + letters_dict: dict[str, list[dict]] = {} + + for char, name, size_list in _DEMO_LETTERS: + shape_fn = _SHAPES.get(char) + if shape_fn is None: + print(f" skip {char} (no shape defined)") + continue + + variants: list[dict] = [] + for i, (w, h) in enumerate(size_list, start=1): + png_bytes = shape_fn(w, h, i - 1) + ink = _ink_ratio(png_bytes) + sha = hashlib.sha256(png_bytes).hexdigest() + + # Asset path relative to letter_set.json + fname = f"{name}-{i:04d}" + asset_rel = f"letters/{_letter_set_name(name)}/{fname}.png" + img_path = writer_dir / asset_rel + img_path.parent.mkdir(parents=True, exist_ok=True) + img_path.write_bytes(png_bytes) + + scan_entry = f"demo__manuscript_scan__p{i:04d}" + variants.append({ + "variant_id": f"{name}-{i:04d}", + "asset_path": asset_rel, + "checksum_sha256": sha, + "image": {"width_px": w, "height_px": h, "format": "png"}, + "quality": {"ink_ratio": round(ink, 4)}, + "source": { + "scan_entry_id": scan_entry, + "scan_url": f"https://example.invalid/scans/{scan_entry}", + "license": "PDM-1.0", + "rights_evidence": "Demo fixture — synthetic glyph, no real provenance.", + "bbox_in_source": {"x": 50 + i * 12, "y": 80 + i * 8, "width": w, "height": h}, + }, + "extracted_at": "2026-05-25T00:00:00Z", + "notes": f"Synthetic demo glyph (variant {i}, {w}x{h}px).", + }) + + letters_dict[char] = variants + print(f" {char} ({name}): {len(variants)} variant(s)") + + # Collect unique scan entry IDs and licenses + source_entry_ids = sorted({ + v["source"]["scan_entry_id"] + for vlist in letters_dict.values() + for v in vlist + }) + licenses = sorted({ + v["source"]["license"] + for vlist in letters_dict.values() + for v in vlist + }) + + letter_set = { + "schema_version": "letter_set.v1", + "writer_id": writer_id, + "writer_label": "Demo Writer (synthetic glyphs — not a real person)", + "writer_provenance": { + "source_repo": "HeOCR/public-domain-hand-written-hebrew-scans", + "source_entry_ids": source_entry_ids, + "attribution_method": "fixture", + "notes": "Synthetic demo generated by scripts/make_demo_candidate.py.", + }, + "generator": { + "name": "hletterscriptgen", + "version": _VERSION, + "config_hash": "0" * 64, + }, + "generated_at": "2026-05-25T00:00:00Z", + "upstream": { + "repo": "HeOCR/public-domain-hand-written-hebrew-scans", + "revision": "0" * 40, + }, + "letters": letters_dict, + "license_summary": { + "licenses": licenses, + "notes": "All variants are synthetic demo fixtures under PDM-1.0.", + }, + } + + ls_path = writer_dir / "letter_set.json" + ls_path.write_text( + json.dumps(letter_set, ensure_ascii=False, indent=2), + encoding="utf-8", + ) + return ls_path + + +def main() -> None: + ap = argparse.ArgumentParser(description="Generate a synthetic demo release candidate.") + ap.add_argument( + "--output", + type=Path, + default=DEFAULT_OUT, + metavar="DIR", + help=f"Output directory (default: {DEFAULT_OUT.relative_to(REPO_ROOT)})", + ) + args = ap.parse_args() + + out_dir = args.output + print(f"Generating demo release candidate in: {out_dir}") + ls_path = build_demo_candidate(out_dir) + total = sum( + len(v) for v in json.loads(ls_path.read_text())["letters"].values() + ) + print(f"\nWrote {ls_path}") + print(f"Total: {total} synthetic variants across {len(_DEMO_LETTERS)} letters") + print() + print("To review:") + print(f" hletterscriptgen review {ls_path}") + + +if __name__ == "__main__": + main() diff --git a/src/hletterscriptgen/cli.py b/src/hletterscriptgen/cli.py index 6214da1..4fe8550 100644 --- a/src/hletterscriptgen/cli.py +++ b/src/hletterscriptgen/cli.py @@ -132,6 +132,33 @@ def _build_parser() -> argparse.ArgumentParser: help="Output format (default: json).", ) + review_p = sub.add_parser( + "review", + help="Serve a local browser-based review UI for a letter_set.json file.", + ) + review_p.add_argument( + "path", + type=Path, + help="Path to a letter_set.json file produced by 'generate'.", + ) + review_p.add_argument( + "--port", + type=int, + default=8765, + metavar="N", + help="Local port to serve on (default: 8765).", + ) + review_p.add_argument( + "--feedback", + type=Path, + default=None, + metavar="FILE", + help=( + "Path to the feedback JSON file (read on load, written on save). " + "Defaults to .review_feedback.json next to the letter-set file." + ), + ) + return parser @@ -275,6 +302,17 @@ def _cmd_scan_blobs(args: argparse.Namespace) -> int: return EXIT_OK +def _cmd_review(args: argparse.Namespace) -> int: + from hletterscriptgen.reviewer import serve + + try: + serve(args.path, port=args.port, feedback_path=args.feedback) + except (FileNotFoundError, ValueError) as exc: + print(str(exc), file=sys.stderr) + return EXIT_INPUT_ERROR + return EXIT_OK + + def main(argv: list[str] | None = None) -> int: parser = _build_parser() args = parser.parse_args(argv) @@ -291,5 +329,7 @@ def main(argv: list[str] | None = None) -> int: return _cmd_check_eligible(args) if args.command == "scan-blobs": return _cmd_scan_blobs(args) + if args.command == "review": + return _cmd_review(args) - parser.error(f"unknown command: {args.command}") + raise AssertionError(f"unhandled command: {args.command}") diff --git a/src/hletterscriptgen/generate_profile.py b/src/hletterscriptgen/generate_profile.py index 7b5ad60..cefffd6 100644 --- a/src/hletterscriptgen/generate_profile.py +++ b/src/hletterscriptgen/generate_profile.py @@ -19,7 +19,7 @@ Profile JSON shape:: { - "upstream_checkout": "../public-domain-hand-written-hebrew-scans", + "upstream_checkout": "../hash", "writers": [ { "writer_id": "writer_bialik", diff --git a/src/hletterscriptgen/reviewer.py b/src/hletterscriptgen/reviewer.py new file mode 100644 index 0000000..c4060dd --- /dev/null +++ b/src/hletterscriptgen/reviewer.py @@ -0,0 +1,804 @@ +"""Local browser-based review server for a letter_set.v1 release candidate. + +Serves a single-page review UI that lets you scroll through every variant in a +``letter_set.json``, mark each one as *accepted*, *rejected*, or *changes +requested*, add free-text comments, and persist the feedback to a JSON file. + +Usage via the CLI:: + + hletterscriptgen review path/to/letter_set.json [--port 8765] + +Usage as a library:: + + from hletterscriptgen.reviewer import serve + from pathlib import Path + serve(Path("out/my_writer/letter_set.json"), port=8765) + +The feedback file (``.review_feedback.json`` next to the letter-set by +default, or the path passed as ``--feedback``) is auto-created on the first +``POST /feedback`` and read back on page load so a review session can be +resumed. +""" + +from __future__ import annotations + +import http.server +import json +import webbrowser +from html import escape as _esc +from pathlib import Path +from typing import Any + +_FEEDBACK_FILENAME = ".review_feedback.json" + +_MIME_MAP: dict[str, str] = { + "png": "image/png", + "jpg": "image/jpeg", + "jpeg": "image/jpeg", + "webp": "image/webp", + "tiff": "image/tiff", +} + +_LETTER_NAMES: dict[str, str] = { + "א": "Alef", + "ב": "Bet", + "ג": "Gimel", + "ד": "Dalet", + "ה": "He", + "ו": "Vav", + "ז": "Zayin", + "ח": "Het", + "ט": "Tet", + "י": "Yod", + "כ": "Kaf", + "ך": "Kaf (final)", + "ל": "Lamed", + "מ": "Mem", + "ם": "Mem (final)", + "נ": "Nun", + "ן": "Nun (final)", + "ס": "Samekh", + "ע": "Ayin", + "פ": "Pe", + "ף": "Pe (final)", + "צ": "Tsadi", + "ץ": "Tsadi (final)", + "ק": "Qof", + "ר": "Resh", + "ש": "Shin", + "ת": "Tav", +} + + +# --------------------------------------------------------------------------- +# HTML template pieces (pure strings — no f-string so curly-braces are safe) +# --------------------------------------------------------------------------- + +_CSS = """ +*,*::before,*::after{box-sizing:border-box;margin:0;padding:0} +body{font-family:system-ui,-apple-system,sans-serif;font-size:14px; +background:#f0f2f5;color:#1a1a2e;display:flex;flex-direction:column;min-height:100vh} +code{font-size:.82em;background:#f0f0f0;padding:1px 4px;border-radius:3px} + +/* top header */ +.top-header{background:#1a1a2e;color:#fff;padding:.75rem 1.5rem; + display:flex;align-items:center;gap:1rem;flex-wrap:wrap; + position:sticky;top:0;z-index:100;box-shadow:0 2px 8px rgba(0,0,0,.35)} +.top-header h1{font-size:1rem;font-weight:600;white-space:nowrap} +.subtitle{font-size:.78rem;color:#aaa;margin-top:1px} +.progress-wrap{flex:1;min-width:180px} +.progress-label{font-size:.73rem;color:#bbb;margin-bottom:3px} +.progress-bar{height:6px;background:#333;border-radius:3px;overflow:hidden} +.progress-fill{height:100%;background:#4caf50;border-radius:3px;transition:width .3s} +.header-actions{display:flex;gap:.5rem} +.hdr-btn{padding:.3rem .75rem;font-size:.78rem;border-radius:4px; + cursor:pointer;border:none;font-weight:500} +.btn-export{background:#4a90d9;color:#fff} +.btn-export:hover{background:#357ab8} +.btn-accept-all{background:#2e7d32;color:#fff} +.btn-accept-all:hover{background:#1b5e20} + +/* layout */ +.main-layout{display:flex;flex:1} + +/* sidebar */ +.sidebar{width:195px;min-width:195px;background:#fff;border-right:1px solid #dde; + position:sticky;top:53px;height:calc(100vh - 53px);overflow-y:auto; + padding:.75rem .4rem} +.sidebar-title{font-size:.68rem;text-transform:uppercase;letter-spacing:.08em; + color:#aaa;padding:.25rem .6rem .5rem} +.letter-nav-item{display:flex;align-items:center;gap:.4rem;padding:.38rem .6rem; + border-radius:5px;text-decoration:none;color:inherit;margin-bottom:1px; + transition:background .12s;cursor:pointer} +.letter-nav-item:hover{background:#f0f2f8} +.letter-nav-item.active{background:#e8edf8;color:#1a3a7a;font-weight:600} +.lni-char{font-size:1.15rem;min-width:22px;text-align:center;direction:rtl} +.lni-name{flex:1;font-size:.78rem;color:#666;overflow:hidden; + white-space:nowrap;text-overflow:ellipsis} +.lni-count{font-size:.68rem;background:#eee;color:#777; + padding:1px 5px;border-radius:8px;flex-shrink:0} +.lni-dots{display:flex;gap:2px;margin-left:2px;flex-shrink:0} +.dot{width:7px;height:7px;border-radius:50%;background:#ccc;display:inline-block} +.dot.accept{background:#4caf50} +.dot.reject{background:#f44336} +.dot.changes{background:#ff9800} + +/* content */ +.content{flex:1;padding:1.25rem 1.5rem;min-width:0} + +/* letter sections */ +.letter-section{margin-bottom:2.25rem} +.letter-section-header{display:flex;align-items:center;gap:.75rem; + margin-bottom:1rem;padding-bottom:.5rem;border-bottom:2px solid #d0d4e8} +.lsh-char{font-size:2rem;direction:rtl;color:#1a1a2e;line-height:1} +.lsh-name{font-size:1.1rem;font-weight:600;color:#1a1a2e} +.lsh-count{font-size:.78rem;color:#888;background:#eee; + padding:2px 8px;border-radius:10px} + +/* variant cards */ +.variant-card{background:#fff;border-radius:8px;padding:1rem; + margin-bottom:.85rem;box-shadow:0 1px 3px rgba(0,0,0,.08); + border:2px solid transparent;transition:border-color .15s} +.variant-card.verdict-accept{border-color:#4caf50} +.variant-card.verdict-reject{border-color:#f44336} +.variant-card.verdict-changes{border-color:#ff9800} +.variant-card.dirty{border-style:dashed} + +.card-header{display:flex;align-items:center;gap:.6rem; + margin-bottom:.75rem;flex-wrap:wrap} +.card-id{font-family:monospace;font-size:.83rem;color:#555} +.card-letter{font-size:.88rem;color:#444;direction:rtl} +.quality-badge{font-size:.7rem;padding:2px 7px;border-radius:10px; + font-weight:600;margin-left:auto;flex-shrink:0} +.quality-ok{background:#e8f5e9;color:#2e7d32} +.quality-warn{background:#fff8e1;color:#e65100} +.quality-low{background:#fce4ec;color:#c62828} +.verdict-badge{font-size:.72rem;padding:2px 8px;border-radius:10px; + font-weight:600;flex-shrink:0} +.vb-accept{background:#4caf50;color:#fff} +.vb-reject{background:#f44336;color:#fff} +.vb-changes{background:#ff9800;color:#fff} + +.card-body{display:flex;gap:1.25rem;flex-wrap:wrap} + +.card-image{display:flex;flex-direction:column;align-items:center; + gap:.4rem;min-width:80px} +.glyph-img{image-rendering:pixelated;max-width:180px;min-width:48px; + border:1px solid #ddd;background:#fff;width:auto;height:auto} +.glyph-missing{width:80px;height:80px;background:#f5f5f5;border:1px dashed #ccc; + display:flex;align-items:center;justify-content:center;font-size:.7rem; + color:#aaa;text-align:center;padding:.5rem;border-radius:4px} +.image-dims{font-size:.68rem;color:#aaa} + +.card-meta{flex:1;min-width:180px} +.meta-table{border-collapse:collapse;width:100%;font-size:.8rem} +.meta-table th{text-align:left;color:#999;padding:2px 8px 2px 0; + white-space:nowrap;font-weight:500;vertical-align:top} +.meta-table td{padding:2px 0;color:#333;word-break:break-all} + +.card-review{flex:1;min-width:210px;display:flex;flex-direction:column;gap:.5rem} +.verdict-btns{display:flex;gap:.4rem;flex-wrap:wrap} +.verdict-btn{flex:1;padding:.38rem .5rem;font-size:.8rem; + border:2px solid transparent;border-radius:5px;cursor:pointer; + font-weight:500;background:#f5f5f5;color:#333;transition:all .15s; + min-width:80px} +.verdict-btn:hover{transform:translateY(-1px);box-shadow:0 2px 5px rgba(0,0,0,.15)} +.btn-accept{border-color:#4caf50} +.btn-accept:hover,.btn-accept.active{background:#4caf50;color:#fff} +.btn-reject{border-color:#f44336} +.btn-reject:hover,.btn-reject.active{background:#f44336;color:#fff} +.btn-changes{border-color:#ff9800} +.btn-changes:hover,.btn-changes.active{background:#ff9800;color:#fff} + +.comment-box{width:100%;font-size:.8rem;border:1px solid #ddd;border-radius:4px; + padding:.38rem .5rem;resize:vertical;font-family:inherit;color:#333} +.comment-box:focus{outline:none;border-color:#4a90d9} +.card-actions{display:flex;align-items:center;gap:.75rem} +.save-btn{padding:.3rem .9rem;font-size:.78rem;border:none;border-radius:4px; + background:#4a90d9;color:#fff;cursor:pointer;font-weight:500} +.save-btn:hover{background:#357ab8} +.saved-ok{font-size:.73rem;color:#4caf50} + +.variant-card.highlight{animation:hl .8s ease-out} +@keyframes hl{0%{box-shadow:0 0 0 4px #4a90d9}100%{box-shadow:none}} + +#toast{position:fixed;bottom:1.5rem;right:1.5rem;background:#222;color:#fff; + padding:.45rem 1rem;border-radius:6px;font-size:.8rem;z-index:999; + opacity:0;transition:opacity .25s;pointer-events:none} +#toast.show{opacity:1} +""" + +# Note: all JS curly braces are literal — this is NOT an f-string. +# Dynamic values are spliced in via .replace() calls in _build_html(). +_SCRIPT = r""" +const ALL_IDS = __ALL_IDS__; +let feedback = {}; +let dirty = new Set(); +let _currentVerdict = {}; + +// --- Feedback persistence --- +async function loadFeedback() { + try { + const r = await fetch('/feedback'); + if (r.ok) { feedback = await r.json(); restoreUI(); } + } catch(e) {} +} + +function restoreUI() { + for (const [vid, fb] of Object.entries(feedback)) { + if (fb.verdict) _applyVerdict(vid, fb.verdict); + const box = document.getElementById('comment-' + vid); + if (box && fb.comment) box.value = fb.comment; + markSaved(vid); + } + updateProgress(); + updateSidebar(); +} + +async function _persistFeedback() { + try { + await fetch('/feedback', { + method: 'POST', + headers: {'Content-Type': 'application/json'}, + body: JSON.stringify(feedback), + }); + dirty.clear(); + } catch(e) { showToast('Save failed: ' + e.message, true); } +} + +// --- Verdict --- +function setVerdict(vid, verdict) { + const prev = _currentVerdict[vid]; + if (prev === verdict) { + delete _currentVerdict[vid]; + _applyVerdict(vid, null); + if (feedback[vid]) delete feedback[vid].verdict; + } else { + _currentVerdict[vid] = verdict; + _applyVerdict(vid, verdict); + feedback[vid] = feedback[vid] || {}; + feedback[vid].verdict = verdict; + } + updateProgress(); + updateSidebar(); + dirty.add(vid); +} + +function _applyVerdict(vid, verdict) { + _currentVerdict[vid] = verdict; + const card = document.getElementById('card-' + vid); + if (!card) return; + card.classList.remove('verdict-accept', 'verdict-reject', 'verdict-changes'); + if (verdict) card.classList.add('verdict-' + verdict); + + card.querySelectorAll('.verdict-btn').forEach(b => b.classList.remove('active')); + if (verdict) { + const btn = card.querySelector('.verdict-btn[data-verdict="' + verdict + '"]'); + if (btn) btn.classList.add('active'); + } + + const badge = document.getElementById('verdict-badge-' + vid); + if (badge) { + const labels = {accept: '✅ Accepted', reject: '❌ Rejected', changes: '🔄 Changes'}; + badge.textContent = verdict ? (labels[verdict] || verdict) : ''; + badge.className = 'verdict-badge' + (verdict ? ' vb-' + verdict : ''); + } +} + +function markDirty(vid) { + dirty.add(vid); + const card = document.getElementById('card-' + vid); + if (card && !feedback[vid]?.verdict) card.classList.add('dirty'); +} + +// --- Save card --- +async function saveCard(vid) { + const box = document.getElementById('comment-' + vid); + const comment = (box && box.value.trim()) || null; + feedback[vid] = feedback[vid] || {}; + if (comment) feedback[vid].comment = comment; + else delete feedback[vid].comment; + if (_currentVerdict[vid]) feedback[vid].verdict = _currentVerdict[vid]; + if (!Object.keys(feedback[vid]).length) delete feedback[vid]; + await _persistFeedback(); + markSaved(vid); + const card = document.getElementById('card-' + vid); + if (card) card.classList.remove('dirty'); + showToast('Saved ' + vid); +} + +function markSaved(vid) { + const el = document.getElementById('saved-' + vid); + if (el) { el.textContent = '✓ saved'; setTimeout(() => { if(el) el.textContent=''; }, 2500); } +} + +// --- Progress --- +function updateProgress() { + const reviewed = ALL_IDS.filter(id => feedback[id]?.verdict).length; + const total = ALL_IDS.length; + const pct = total ? (reviewed / total * 100).toFixed(0) : 0; + const lbl = document.getElementById('progress-label'); + const fill = document.getElementById('progress-fill'); + if (lbl) lbl.textContent = reviewed + ' / ' + total + ' reviewed'; + if (fill) fill.style.width = pct + '%'; +} + +// --- Sidebar dots --- +function _letterAnchorId(char) { + const pts = [...char].map(c => 'u' + c.codePointAt(0).toString(16).padStart(4, '0')); + return 'letter-' + pts.join(''); +} +function updateSidebar() { + document.querySelectorAll('.letter-nav-item').forEach(nav => { + const letter = nav.dataset.letter; + const dotsEl = nav.querySelector('.lni-dots'); + if (!dotsEl || !letter) return; + const sid = _letterAnchorId(letter); + const section = document.getElementById(sid); + if (!section) return; + const cards = section.querySelectorAll('.variant-card'); + dotsEl.innerHTML = ''; + cards.forEach(card => { + const vid = card.dataset.variantId; + const v = feedback[vid]?.verdict; + const dot = document.createElement('span'); + dot.className = 'dot' + (v ? ' ' + v : ''); + dotsEl.appendChild(dot); + }); + }); +} + +// --- Accept all unreviewed --- +async function acceptAllUnreviewed() { + const unrev = ALL_IDS.filter(id => !feedback[id]?.verdict); + if (!unrev.length) { showToast('All variants already reviewed'); return; } + unrev.forEach(vid => { + _currentVerdict[vid] = 'accept'; + feedback[vid] = { ...(feedback[vid] || {}), verdict: 'accept' }; + _applyVerdict(vid, 'accept'); + }); + await _persistFeedback(); + updateProgress(); + updateSidebar(); + showToast('Accepted ' + unrev.length + ' unreviewed variant(s)'); +} + +// --- Export --- +function exportFeedback() { + const blob = new Blob([JSON.stringify(feedback, null, 2)], {type: 'application/json'}); + const url = URL.createObjectURL(blob); + const a = document.createElement('a'); + a.href = url; a.download = 'review_feedback.json'; a.click(); + URL.revokeObjectURL(url); +} + +// --- Toast --- +let _toastTimer = null; +function showToast(msg, err=false) { + const el = document.getElementById('toast'); + if (!el) return; + el.textContent = msg; + el.style.background = err ? '#c00' : '#222'; + el.classList.add('show'); + if (_toastTimer) clearTimeout(_toastTimer); + _toastTimer = setTimeout(() => el.classList.remove('show'), 3200); +} + +// --- Event delegation (replaces inline onclick/oninput) --- +document.addEventListener('click', e => { + const vbtn = e.target.closest('.verdict-btn[data-vid]'); + if (vbtn) { setVerdict(vbtn.dataset.vid, vbtn.dataset.verdict); return; } + const sbtn = e.target.closest('.save-btn[data-vid]'); + if (sbtn) { saveCard(sbtn.dataset.vid); } +}); +document.addEventListener('input', e => { + const box = e.target.closest('.comment-box[data-vid]'); + if (box) markDirty(box.dataset.vid); +}); + +// --- Active sidebar highlight on scroll --- +const _io = new IntersectionObserver(entries => { + for (const e of entries) { + if (e.isIntersecting) { + document.querySelectorAll('.letter-nav-item').forEach(n => n.classList.remove('active')); + const nav = document.querySelector('.letter-nav-item[href="#' + e.target.id + '"]'); + if (nav) nav.classList.add('active'); + } + } +}, {rootMargin: '-5% 0px -85% 0px'}); +document.querySelectorAll('.letter-section').forEach(s => _io.observe(s)); + +// Warn on unsaved changes +window.addEventListener('beforeunload', e => { + if (dirty.size > 0) { e.preventDefault(); e.returnValue = ''; } +}); + +loadFeedback(); +""" + + +# --------------------------------------------------------------------------- +# HTML builders +# --------------------------------------------------------------------------- + + +def _letter_anchor(char: str) -> str: + """Stable ASCII anchor for a Hebrew Unicode character, e.g. 'u05d0'.""" + return "".join(f"u{ord(c):04x}" for c in char) + + +def _ink_quality(ink_ratio: float) -> tuple[str, str]: + """(label, css_class) for an ink_ratio in [0, 1].""" + if ink_ratio < 0.08: + return "Very sparse", "quality-low" + if ink_ratio < 0.15: + return "Sparse", "quality-warn" + if ink_ratio <= 0.60: + return "Normal", "quality-ok" + return "Dense", "quality-warn" + + +def _build_variant_card( + variant: dict[str, Any], + letter_char: str, + base_dir: Path, + images: dict[str, Path], +) -> str: + """Build HTML for one variant card. + + Populates *images* with ``{variant_id: absolute_path}`` for variants whose + asset file exists; the HTTP handler serves them at ``/image/``. + """ + vid = variant["variant_id"] + vid_attr = _esc(vid) + + try: + w = variant["image"]["width_px"] + h = variant["image"]["height_px"] + fmt = variant["image"]["format"] + ink_ratio: float = variant["quality"]["ink_ratio"] + except (KeyError, TypeError) as exc: + raise ValueError(f"Malformed variant {vid!r}: missing field {exc}") from exc + + q_label, q_cls = _ink_quality(ink_ratio) + + source = variant.get("source", {}) + scan_id = _esc(str(source.get("scan_entry_id", "—"))) + lic = _esc(str(source.get("license", "—"))) + bbox = source.get("bbox_in_source", {}) + bbox_str = ( + f"x={bbox.get('x')}, y={bbox.get('y')}, " + f"w={bbox.get('width')}, h={bbox.get('height')}" + if bbox + else "—" + ) + letter_name = _esc(_LETTER_NAMES.get(letter_char, letter_char)) + + asset_path = base_dir / variant["asset_path"] + if asset_path.exists(): + images[vid] = asset_path + img_html = ( + f'' + ) + else: + img_html = 'Image not found' + + return ( + f'\n' + f' \n' + f' {vid_attr}\n' + f' {_esc(letter_char)} {letter_name}\n' + f' {q_label} ({ink_ratio:.2f})\n' + f' \n' + f' \n' + f' \n' + f' {img_html}' + f' {w}\xd7{h} px\n' + f' \n' + f' Format{_esc(str(fmt))}\n' + f' Size{w}\xd7{h} px\n' + f' Ink ratio{ink_ratio:.3f}\n' + f' Source{scan_id}\n' + f' License{lic}\n' + f' Bbox{bbox_str}\n' + f' \n' + f' \n' + f' \n' + f' ✅ Accept\n' + f' ❌ Reject\n' + f' 🔄 Changes\n' + f' \n' + f' \n' + f' \n' + f' Save\n' + f' \n' + f' \n' + f' \n' + f' \n' + f'\n' + ) + + +def _build_sidebar(letter_set: dict[str, Any]) -> str: + parts: list[str] = [] + for char, variants in letter_set.get("letters", {}).items(): + name = _LETTER_NAMES.get(char, char) + anchor = _letter_anchor(char) + count = len(variants) + dots = "".join('' for _ in variants) + parts.append( + f'' + f'{_esc(char)}' + f'{_esc(name)}' + f'{count}' + f'{dots}' + f'' + ) + return "\n".join(parts) + + +def _build_sections( + letter_set: dict[str, Any], base_dir: Path +) -> tuple[str, list[str], dict[str, Path]]: + """Return (sections_html, all_variant_ids, images_map). + + *images_map* maps variant_id → absolute asset path for every variant whose + file exists on disk; used by the HTTP handler to serve ``/image/``. + """ + sections: list[str] = [] + all_ids: list[str] = [] + images: dict[str, Path] = {} + + for char, variants in letter_set.get("letters", {}).items(): + name = _LETTER_NAMES.get(char, char) + anchor = _letter_anchor(char) + n = len(variants) + s_label = "variant" if n == 1 else "variants" + cards = "".join(_build_variant_card(v, char, base_dir, images) for v in variants) + all_ids.extend(v["variant_id"] for v in variants) + sections.append( + f'\n' + f' ' + f'{_esc(char)}' + f'{_esc(name)}' + f'{n} {s_label}' + f'\n{cards}\n' + ) + return "".join(sections), all_ids, images + + +def _build_html( + letter_set: dict[str, Any], + base_dir: Path, +) -> tuple[str, dict[str, Path]]: + """Build the review page HTML and return ``(html_str, images_map)``.""" + writer_id = letter_set.get("writer_id", "") + writer_label = letter_set.get("writer_label", "") + generated_at = letter_set.get("generated_at", "") + date_str = generated_at[:10] if generated_at else "—" + + letters = letter_set.get("letters", {}) + total = sum(len(v) for v in letters.values()) + + sidebar_html = _build_sidebar(letter_set) + sections_html, all_ids, images = _build_sections(letter_set, base_dir) + + script = _SCRIPT.replace("__ALL_IDS__", json.dumps(all_ids)) + + label_line = ( + f'{_esc(writer_label)}\n' if writer_label else "" + ) + title_esc = _esc(writer_id) + + html_str = ( + "\n" + '\n\n' + '\n' + '\n' + f"Review — {title_esc}\n" + f"\n" + "\n\n" + '\n' + " \n" + f' \U0001f50c Review — {title_esc}\n' + f" {label_line}" + f' Generated: {date_str}' + f" · {total} variant{'s' if total != 1 else ''}\n" + " \n" + ' \n' + ' ' + f'0 / {total} reviewed\n' + ' ' + '\n' + " \n" + ' \n' + ' ' + "⬇ Export\n" + ' ' + "✅ Accept unreviewed\n" + " \n" + "\n" + '\n' + ' \n' + ' Letters\n' + f" {sidebar_html}\n" + " \n" + ' \n' + f" {sections_html}\n" + " \n" + "\n" + '\n' + f"\n" + "\n\n" + ) + return html_str, images + + +# --------------------------------------------------------------------------- +# HTTP server +# --------------------------------------------------------------------------- + + +class _ReviewHandler(http.server.BaseHTTPRequestHandler): + """Minimal handler: serves pre-built HTML, images, and manages feedback JSON. + + Concrete values for ``_html``, ``_feedback_path``, and ``_images`` must be + provided by a subclass (``serve()`` creates one per invocation to avoid + shared class-level state). + """ + + _html: str + _feedback_path: Path + _images: dict[str, Path] + + def log_message(self, fmt: str, *args: object) -> None: # silence request log + pass + + def do_GET(self) -> None: + if self.path == "/": + body = self._html.encode("utf-8") + self.send_response(200) + self.send_header("Content-Type", "text/html; charset=utf-8") + self.send_header("Content-Length", str(len(body))) + self.end_headers() + self.wfile.write(body) + + elif self.path == "/feedback": + data: dict[str, Any] = {} + fp = self._feedback_path + if fp.exists(): + try: + data = json.loads(fp.read_text(encoding="utf-8")) + except json.JSONDecodeError: + pass + body = json.dumps(data, ensure_ascii=False, indent=2).encode("utf-8") + self.send_response(200) + self.send_header("Content-Type", "application/json") + self.send_header("Content-Length", str(len(body))) + self.end_headers() + self.wfile.write(body) + + elif self.path.startswith("/image/"): + # Strip query string / fragment; dict lookup prevents path traversal. + vid = self.path[len("/image/"):].split("?")[0].split("#")[0] + img_path = self._images.get(vid) + if img_path is None or not img_path.exists(): + self.send_response(404) + self.end_headers() + return + ext = img_path.suffix.lower().lstrip(".") + mime = _MIME_MAP.get(ext, "image/png") + body = img_path.read_bytes() + self.send_response(200) + self.send_header("Content-Type", mime) + self.send_header("Content-Length", str(len(body))) + self.end_headers() + self.wfile.write(body) + + else: + self.send_response(404) + self.end_headers() + + def do_POST(self) -> None: + length = int(self.headers.get("Content-Length", 0)) + if self.path == "/feedback": + raw = self.rfile.read(length) + try: + data = json.loads(raw.decode("utf-8")) + except json.JSONDecodeError: + self.send_response(400) + self.end_headers() + return + # Atomic write: write to a sibling .tmp file then replace(). + # Prevents a corrupt feedback file if the process is killed mid-write. + tmp = self._feedback_path.with_suffix(".tmp") + tmp.write_text( + json.dumps(data, ensure_ascii=False, indent=2, sort_keys=True), + encoding="utf-8", + ) + tmp.replace(self._feedback_path) + self.send_response(204) + self.end_headers() + else: + if length > 0: + self.rfile.read(length) # drain body before responding to avoid TCP RST + self.send_response(404) + self.end_headers() + + +# --------------------------------------------------------------------------- +# Public API +# --------------------------------------------------------------------------- + + +def serve( + path: Path, + *, + port: int = 8765, + feedback_path: Path | None = None, +) -> None: + """Build a review page from *path* and serve it on *localhost:port*. + + Parameters + ---------- + path: + Absolute or relative path to a ``letter_set.json`` file. PNG assets + are resolved relative to ``path.parent``. + port: + TCP port to listen on (default: 8765). + feedback_path: + Where to read/write feedback JSON. Defaults to + ``.review_feedback.json`` next to *path*. + """ + path = Path(path).resolve() + base_dir = path.parent + if feedback_path is None: + feedback_path = base_dir / _FEEDBACK_FILENAME + + try: + letter_set: dict[str, Any] = json.loads(path.read_text(encoding="utf-8")) + except FileNotFoundError as exc: + raise FileNotFoundError(f"letter_set file not found: {path}") from exc + except json.JSONDecodeError as exc: + raise ValueError(f"letter_set file is not valid JSON: {exc}") from exc + + html_str, images = _build_html(letter_set, base_dir) + + # Create a fresh handler subclass per invocation so each server has its own + # isolated state rather than mutating shared class attributes. + class _Handler(_ReviewHandler): + _html = html_str + _feedback_path = feedback_path # type: ignore[assignment] + _images = images + + server = http.server.HTTPServer(("127.0.0.1", port), _Handler) + url = f"http://localhost:{port}/" + writer_id = letter_set.get("writer_id", path.name) + total = sum(len(v) for v in letter_set.get("letters", {}).values()) + + print(f"Review server: {url}") + print(f"Writer: {writer_id} ({total} variant(s))") + print(f"Feedback file: {feedback_path}") + print("Press Ctrl-C to stop.") + print() + + # The socket is bound and listening after HTTPServer.__init__(), so the + # browser can connect immediately without a timing delay. + webbrowser.open(url) + try: + server.serve_forever() + except KeyboardInterrupt: + print("\nStopped.") + + +__all__ = ["serve"] diff --git a/src/hletterscriptgen/schemas/letter_set.schema.json b/src/hletterscriptgen/schemas/letter_set.schema.json index 14d8950..31f0d14 100644 --- a/src/hletterscriptgen/schemas/letter_set.schema.json +++ b/src/hletterscriptgen/schemas/letter_set.schema.json @@ -37,7 +37,7 @@ "source_repo": { "type": "string", "minLength": 1, - "description": "Upstream repository identifier, e.g. 'HeOCR/public-domain-hand-written-hebrew-scans'." + "description": "Upstream repository identifier, e.g. 'HeOCR/hash'." }, "source_entry_ids": { "type": "array", @@ -81,7 +81,7 @@ "repo": { "type": "string", "minLength": 1, - "description": "Upstream repository identifier in 'owner/name' form, e.g. 'HeOCR/public-domain-hand-written-hebrew-scans'." + "description": "Upstream repository identifier in 'owner/name' form, e.g. 'HeOCR/hash'." }, "revision": { "type": "string", diff --git a/src/hletterscriptgen/upstream.py b/src/hletterscriptgen/upstream.py index c50b406..85e2e38 100644 --- a/src/hletterscriptgen/upstream.py +++ b/src/hletterscriptgen/upstream.py @@ -1,7 +1,7 @@ -"""Upstream integration: read and filter ``public-domain-hand-written-hebrew-scans``. +"""Upstream integration: read and filter ``HeOCR/hash`` (HASH). This module is read-only: it consumes a local checkout of the upstream -scan corpus (``HeOCR/public-domain-hand-written-hebrew-scans``) and +scan corpus (``HeOCR/hash``) and exposes the records the generator pipeline actually needs. The full upstream contract is broader than what is modelled here; see ``schemas/entry.schema.json`` in the upstream repo. @@ -166,7 +166,7 @@ class UpstreamPin: """The ``(repo, revision)`` pair written to ``letter_set.v1.upstream``. ``repo`` is the ``owner/name`` form of the upstream remote (e.g. - ``"HeOCR/public-domain-hand-written-hebrew-scans"``). ``revision`` + ``"HeOCR/hash"``). ``revision`` is the full SHA of the pinned ``HEAD`` commit. """ diff --git a/tests/fixtures/attribution/writer_profile.json b/tests/fixtures/attribution/writer_profile.json index 3f9960d..c17cd46 100644 --- a/tests/fixtures/attribution/writer_profile.json +++ b/tests/fixtures/attribution/writer_profile.json @@ -1,5 +1,5 @@ { - "upstream_path": "../public-domain-hand-written-hebrew-scans", + "upstream_path": "../hash", "writers": [ { "writer_id": "writer_bialik", diff --git a/tests/test_attribution.py b/tests/test_attribution.py index 4b18283..a30cc35 100644 --- a/tests/test_attribution.py +++ b/tests/test_attribution.py @@ -71,7 +71,7 @@ def test_roundtrip_parse_fixture() -> None: profile = load_attribution(PROFILE_PATH) assert isinstance(profile, WriterProfile) - assert profile.upstream_path == Path("../public-domain-hand-written-hebrew-scans") + assert profile.upstream_path == Path("../hash") writers_by_id = {w.writer_id: w for w in profile.writers} assert set(writers_by_id) == {"writer_bialik", "writer_herzl"} diff --git a/tests/test_generator.py b/tests/test_generator.py index ba2b2da..9b060a6 100644 --- a/tests/test_generator.py +++ b/tests/test_generator.py @@ -64,7 +64,7 @@ def _make_upstream_checkout(tmp_path: Path) -> Path: _git(repo, "config", "user.email", "test@example.com") _git(repo, "config", "user.name", "Test") _git(repo, "config", "commit.gpgsign", "false") - _git(repo, "remote", "add", "origin", "https://github.com/HeOCR/public-domain-hand-written-hebrew-scans.git") + _git(repo, "remote", "add", "origin", "https://github.com/HeOCR/hash.git") # entries.jsonl index_dir = repo / "data" / "index" @@ -156,7 +156,7 @@ def test_generate_letter_set_content(tmp_path: Path) -> None: doc = json.loads(paths[0].read_text(encoding="utf-8")) assert doc["schema_version"] == "letter_set.v1" assert doc["writer_id"] == "writer_test_a" - assert doc["upstream"]["repo"] == "HeOCR/public-domain-hand-written-hebrew-scans" + assert doc["upstream"]["repo"] == "HeOCR/hash" assert doc["generator"]["name"] == "hletterscriptgen" # Both annotated letters must appear assert "א" in doc["letters"] @@ -324,7 +324,7 @@ def _make_upstream_checkout_no_cv2(tmp_path: Path) -> Path: _git(repo, "config", "user.email", "test@example.com") _git(repo, "config", "user.name", "Test") _git(repo, "config", "commit.gpgsign", "false") - _git(repo, "remote", "add", "origin", "https://github.com/HeOCR/public-domain-hand-written-hebrew-scans.git") + _git(repo, "remote", "add", "origin", "https://github.com/HeOCR/hash.git") index_dir = repo / "data" / "index" index_dir.mkdir(parents=True) diff --git a/tests/test_reviewer.py b/tests/test_reviewer.py new file mode 100644 index 0000000..ebad573 --- /dev/null +++ b/tests/test_reviewer.py @@ -0,0 +1,599 @@ +"""Tests for hletterscriptgen.reviewer — review app HTML builder and HTTP server.""" + +from __future__ import annotations + +import http.client +import json +import struct +import threading +import zlib +from http.server import HTTPServer +from pathlib import Path +from typing import Any + +import pytest + +from hletterscriptgen.reviewer import ( + _build_html, + _build_sections, + _build_sidebar, + _build_variant_card, + _ink_quality, + _letter_anchor, + _ReviewHandler, + serve, +) + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + + +def _minimal_png(width: int = 4, height: int = 4) -> bytes: + """Return a tiny valid grayscale PNG for testing.""" + raw = b"".join(b"\x00" + bytes([128] * width) for _ in range(height)) + compressed = zlib.compress(raw, 1) + + def chunk(tag: bytes, data: bytes) -> bytes: + crc = zlib.crc32(tag + data) & 0xFFFF_FFFF + return struct.pack(">I", len(data)) + tag + data + struct.pack(">I", crc) + + ihdr_data = struct.pack(">II", width, height) + bytes([8, 0, 0, 0, 0]) + return ( + b"\x89PNG\r\n\x1a\n" + + chunk(b"IHDR", ihdr_data) + + chunk(b"IDAT", compressed) + + chunk(b"IEND", b"") + ) + + +def _make_letter_set( + writer_id: str = "w1", + letters: dict[str, list[dict[str, Any]]] | None = None, +) -> dict[str, Any]: + if letters is None: + letters = { + "א": [ + { + "variant_id": "alef-0001", + "asset_path": "letters/alef/alef-0001.png", + "checksum_sha256": "a" * 64, + "image": {"width_px": 32, "height_px": 40, "format": "png"}, + "quality": {"ink_ratio": 0.25}, + "source": { + "scan_entry_id": "scan-001", + "license": "PDM-1.0", + "bbox_in_source": {"x": 10, "y": 20, "width": 32, "height": 40}, + }, + } + ] + } + return { + "schema_version": "letter_set.v1", + "writer_id": writer_id, + "writer_label": "Test Writer", + "generated_at": "2026-05-25T00:00:00Z", + "letters": letters, + } + + +def _start_one_shot_server( + html: str, + feedback_path: Path, + images: dict[str, Path] | None = None, +) -> tuple[int, HTTPServer]: + """Start an HTTPServer bound to a free port; return (port, server). + + Creates a fresh _ReviewHandler subclass per call so tests are isolated — + no shared class-level state between invocations. + """ + class _TestHandler(_ReviewHandler): + pass + + _TestHandler._html = html + _TestHandler._feedback_path = feedback_path + _TestHandler._images = images if images is not None else {} + + srv = HTTPServer(("127.0.0.1", 0), _TestHandler) + return srv.server_address[1], srv + + +# --------------------------------------------------------------------------- +# _letter_anchor +# --------------------------------------------------------------------------- + + +def test_letter_anchor_alef() -> None: + assert _letter_anchor("א") == "u05d0" + + +def test_letter_anchor_resh() -> None: + assert _letter_anchor("ר") == "u05e8" + + +def test_letter_anchor_tav() -> None: + assert _letter_anchor("ת") == "u05ea" + + +# --------------------------------------------------------------------------- +# _ink_quality +# --------------------------------------------------------------------------- + + +def test_ink_quality_very_sparse() -> None: + label, cls = _ink_quality(0.03) + assert label == "Very sparse" + assert cls == "quality-low" + + +def test_ink_quality_sparse() -> None: + label, cls = _ink_quality(0.10) + assert label == "Sparse" + assert cls == "quality-warn" + + +def test_ink_quality_normal() -> None: + label, cls = _ink_quality(0.30) + assert label == "Normal" + assert cls == "quality-ok" + + +def test_ink_quality_normal_upper_boundary() -> None: + label, cls = _ink_quality(0.60) + assert label == "Normal" + assert cls == "quality-ok" + + +def test_ink_quality_dense() -> None: + label, cls = _ink_quality(0.80) + assert label == "Dense" + assert cls == "quality-warn" + + +# --------------------------------------------------------------------------- +# _build_variant_card +# --------------------------------------------------------------------------- + + +def test_variant_card_contains_variant_id(tmp_path: Path) -> None: + variant: dict[str, Any] = { + "variant_id": "alef-0042", + "asset_path": "letters/alef/alef-0042.png", + "checksum_sha256": "a" * 64, + "image": {"width_px": 30, "height_px": 40, "format": "png"}, + "quality": {"ink_ratio": 0.28}, + "source": { + "scan_entry_id": "s001", + "license": "PDM-1.0", + "bbox_in_source": {"x": 5, "y": 10, "width": 30, "height": 40}, + }, + } + images: dict[str, Path] = {} + html = _build_variant_card(variant, "א", tmp_path, images) + assert "alef-0042" in html + assert 'id="card-alef-0042"' in html + + +def test_variant_card_missing_image_shows_fallback(tmp_path: Path) -> None: + variant: dict[str, Any] = { + "variant_id": "alef-0001", + "asset_path": "letters/alef/missing.png", + "checksum_sha256": "a" * 64, + "image": {"width_px": 30, "height_px": 40, "format": "png"}, + "quality": {"ink_ratio": 0.25}, + "source": { + "scan_entry_id": "s001", + "license": "PDM-1.0", + "bbox_in_source": {"x": 5, "y": 10, "width": 30, "height": 40}, + }, + } + images: dict[str, Path] = {} + html = _build_variant_card(variant, "א", tmp_path, images) + assert "glyph-missing" in html + assert " None: + png_path = tmp_path / "letters" / "alef" + png_path.mkdir(parents=True) + (png_path / "alef-0001.png").write_bytes(_minimal_png()) + + variant: dict[str, Any] = { + "variant_id": "alef-0001", + "asset_path": "letters/alef/alef-0001.png", + "checksum_sha256": "a" * 64, + "image": {"width_px": 4, "height_px": 4, "format": "png"}, + "quality": {"ink_ratio": 0.28}, + "source": { + "scan_entry_id": "s001", + "license": "PDM-1.0", + "bbox_in_source": {"x": 5, "y": 10, "width": 4, "height": 4}, + }, + } + images: dict[str, Path] = {} + html = _build_variant_card(variant, "א", tmp_path, images) + assert ' None: + """Dense ink_ratio should produce the 'quality-warn' class.""" + variant: dict[str, Any] = { + "variant_id": "v1", + "asset_path": "x.png", + "checksum_sha256": "a" * 64, + "image": {"width_px": 10, "height_px": 10, "format": "png"}, + "quality": {"ink_ratio": 0.90}, + "source": {"scan_entry_id": "s", "license": "PDM-1.0", + "bbox_in_source": {"x": 0, "y": 0, "width": 10, "height": 10}}, + } + images: dict[str, Path] = {} + html = _build_variant_card(variant, "ב", tmp_path, images) + assert "quality-warn" in html + + +def test_variant_card_uses_data_attributes_not_inline_handlers(tmp_path: Path) -> None: + variant: dict[str, Any] = { + "variant_id": "alef-0001", + "asset_path": "x.png", + "checksum_sha256": "a" * 64, + "image": {"width_px": 10, "height_px": 10, "format": "png"}, + "quality": {"ink_ratio": 0.25}, + "source": {"scan_entry_id": "s", "license": "PDM-1.0", + "bbox_in_source": {"x": 0, "y": 0, "width": 10, "height": 10}}, + } + images: dict[str, Path] = {} + html = _build_variant_card(variant, "א", tmp_path, images) + assert 'onclick=' not in html + assert 'oninput=' not in html + assert 'data-vid=' in html + assert 'data-verdict=' in html + + +def test_variant_card_malformed_variant_raises(tmp_path: Path) -> None: + bad: dict[str, Any] = { + "variant_id": "v1", + "asset_path": "x.png", + # missing "image" and "quality" + } + with pytest.raises(ValueError, match="Malformed variant"): + _build_variant_card(bad, "א", tmp_path, {}) + + +# --------------------------------------------------------------------------- +# _build_sidebar +# --------------------------------------------------------------------------- + + +def test_build_sidebar_contains_all_letters(tmp_path: Path) -> None: + ls = _make_letter_set( + letters={ + "א": [{"variant_id": "a1", "asset_path": "x.png", + "checksum_sha256": "a" * 64, + "image": {"width_px": 1, "height_px": 1, "format": "png"}, + "quality": {"ink_ratio": 0.2}, + "source": {"scan_entry_id": "s", "license": "PDM-1.0", + "bbox_in_source": {"x": 0, "y": 0, "width": 1, "height": 1}}}], + "ב": [{"variant_id": "b1", "asset_path": "y.png", + "checksum_sha256": "b" * 64, + "image": {"width_px": 1, "height_px": 1, "format": "png"}, + "quality": {"ink_ratio": 0.3}, + "source": {"scan_entry_id": "s", "license": "PDM-1.0", + "bbox_in_source": {"x": 0, "y": 0, "width": 1, "height": 1}}}], + } + ) + html = _build_sidebar(ls) + assert "u05d0" in html # Alef anchor + assert "u05d1" in html # Bet anchor + assert "letter-nav-item" in html + + +def test_build_sidebar_empty_letters() -> None: + html = _build_sidebar({"letters": {}}) + assert html == "" + + +# --------------------------------------------------------------------------- +# _build_sections +# --------------------------------------------------------------------------- + + +def test_build_sections_returns_all_ids(tmp_path: Path) -> None: + ls = _make_letter_set() + _, ids, _ = _build_sections(ls, tmp_path) + assert ids == ["alef-0001"] + + +def test_build_sections_html_contains_section_id(tmp_path: Path) -> None: + ls = _make_letter_set() + html, _, _ = _build_sections(ls, tmp_path) + assert 'id="letter-u05d0"' in html + + +def test_build_sections_images_map_populated_when_file_exists(tmp_path: Path) -> None: + png_dir = tmp_path / "letters" / "alef" + png_dir.mkdir(parents=True) + (png_dir / "alef-0001.png").write_bytes(_minimal_png()) + ls = _make_letter_set() + _, _, images = _build_sections(ls, tmp_path) + assert "alef-0001" in images + + +def test_build_sections_images_map_empty_when_file_missing(tmp_path: Path) -> None: + ls = _make_letter_set() # asset file does not exist in tmp_path + _, _, images = _build_sections(ls, tmp_path) + assert images == {} + + +# --------------------------------------------------------------------------- +# _build_html +# --------------------------------------------------------------------------- + + +def test_build_html_contains_writer_id(tmp_path: Path) -> None: + ls = _make_letter_set(writer_id="my-writer-007") + html, _ = _build_html(ls, tmp_path) + assert "my-writer-007" in html + + +def test_build_html_contains_progress_elements(tmp_path: Path) -> None: + ls = _make_letter_set() + html, _ = _build_html(ls, tmp_path) + assert "progress-fill" in html + assert "progress-label" in html + + +def test_build_html_embeds_all_ids_in_script(tmp_path: Path) -> None: + ls = _make_letter_set() + html, _ = _build_html(ls, tmp_path) + assert '"alef-0001"' in html # variant_id appears in the JS ALL_IDS array + + +def test_build_html_no_writer_label_skips_label_div(tmp_path: Path) -> None: + ls = _make_letter_set() + del ls["writer_label"] + html, _ = _build_html(ls, tmp_path) + assert "Test Writer" not in html + + +def test_build_html_is_valid_html_scaffold(tmp_path: Path) -> None: + ls = _make_letter_set() + html, _ = _build_html(ls, tmp_path) + assert html.startswith("") + assert "
{scan_id}