HeOCR · shaypal5 · May 25, 2026 · May 25, 2026 · May 31, 2026 · May 31, 2026
@@ -26,7 +26,7 @@ hletterscriptgen validate examples/letter_set/writer_example.json --format json
   scans into per-writer letter-glyph image sets.
 - `hletterscript` (separate repo) owns the **published letter-set datasets**.
   Do not commit generated glyph images to this repo.
-- `public-domain-hand-written-hebrew-scans` (separate repo) owns
+- `hash` (separate repo) owns
   **upstream scans** and their rights records.
 - `hocrsyngen`, `hocrgen`, `HeOCR`, `HeOCRsynth` are downstream consumers.
   Do not import them from `hletterscriptgen` and do not build their
@@ -44,7 +44,7 @@ hletterscriptgen validate examples/letter_set/writer_example.json --format json
 ## Rights-carryover rules
 
 - Every variant must carry a `source.scan_entry_id` that resolves against
-  the upstream `public-domain-hand-written-hebrew-scans` index, plus a
+  the upstream `hash` index, plus a
   `source.license` matching the upstream record. The generator never
   invents, broadens, or relicenses upstream rights.
 - `license_summary.licenses` must include every distinct license that

@@ -7,7 +7,7 @@ Thanks for considering a contribution to `hletterscriptgen`.
 This repo holds the **code** that produces per-writer Hebrew letter-glyph
 image sets. It does **not** host the letter-set images themselves (those
 live in `HeOCR/hletterscript`), and it does **not** ingest upstream scans
-(those live in `HeOCR/public-domain-hand-written-hebrew-scans`). Please
+(those live in `HeOCR/hash`). Please
 keep PRs aligned with that boundary; cross-repo concerns belong upstream
 or downstream.
 

@@ -16,7 +16,7 @@ rules apply to each layer.
 ## 2. Generated letter-set datasets
 
 The generator processes scans from
-[`HeOCR/public-domain-hand-written-hebrew-scans`](https://github.com/HeOCR/public-domain-hand-written-hebrew-scans).
+[`HeOCR/hash`](https://github.com/HeOCR/hash).
 That upstream repository uses a compound licensing model with rights recorded
 **per scan**. `hletterscriptgen` follows the same posture:
 

@@ -1,95 +1,141 @@
 # hletterscriptgen
 
-Generator framework for **per-writer Hebrew letter-glyph image sets**, built
-on rights-clean upstream scans of handwritten Hebrew documents.
-
-`hletterscriptgen` is part of the [HeOCR](https://github.com/HeOCR) project.
-It consumes scan-level records from
-[`HeOCR/public-domain-hand-written-hebrew-scans`](https://github.com/HeOCR/public-domain-hand-written-hebrew-scans)
-and produces letter-set datasets that land in
-[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript). Downstream,
-[`HeOCR/hocrsyngen`](https://github.com/HeOCR/hocrsyngen) composes those
-glyphs into synthetic Hebrew handwritten pages, which
-[`HeOCR/hocrgen`](https://github.com/HeOCR/hocrgen) folds into
-[`HeOCR/HeOCR`](https://github.com/HeOCR/HeOCR) and
-[`HeOCR/HeOCRsynth`](https://github.com/HeOCR/HeOCRsynth).
-
-## Repository scope
-
-This repository contains the **code, schemas, and contracts** that produce
-letter sets. It does **not** host the letter-set image data; that lives in
-`HeOCR/hletterscript`.
+[![CI](https://github.com/HeOCR/hletterscriptgen/actions/workflows/ci.yml/badge.svg)](https://github.com/HeOCR/hletterscriptgen/actions/workflows/ci.yml)
+![License](https://img.shields.io/badge/license-MIT-green)
+![Python](https://img.shields.io/badge/python-3.11%2B-blue)
+![Schema](https://img.shields.io/badge/schema-letter__set.v1-blue)
 
-What lives here:
+Created by [Shay Palachy Affek](http://www.shaypalachy.com/).
+
+Generator framework for per-writer Hebrew handwritten letter-glyph image
+sets. It turns rights-clean HASH scan records plus human-reviewed glyph
+annotations into deterministic `letter_set.v1` outputs for
+[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript).
 
-- A Python package and CLI (`hletterscriptgen`).
-- The `letter_set.v1` JSON Schema and a fixture example
-  (`examples/letter_set/writer_example.json`).
-- Validation tooling (`hletterscriptgen validate`).
-- CI that enforces schema and tooling invariants.
-- Licensing policy and rights-carryover rules
-  ([`LICENSE-POLICY.md`](LICENSE-POLICY.md)).
+![hletterscriptgen output contract diagram](docs/assets/hletterscriptgen-output-contract.png)
 
-What does **not** live here:
+## At a Glance
 
-- Actual extracted glyph images (→ `HeOCR/hletterscript`).
-- Page-scan ingestion or rights curation (→
-  `HeOCR/public-domain-hand-written-hebrew-scans`).
-- Document composition (→ `HeOCR/hocrsyngen`).
-- Dataset orchestration, governance, release assembly, or publication (→
-  `HeOCR/hocrgen` / `HeOCR/HeOCR` / `HeOCR/HeOCRsynth`).
+| Field | Value |
+| --- | --- |
+| Role | Generate and validate per-writer Hebrew letter-set outputs |
+| Input | HASH scan metadata, scan image files, and generation profiles |
+| Output | `letter_set.v1` JSON plus cropped glyph PNG assets |
+| Public contract | [`docs/letter_set_v1.md`](docs/letter_set_v1.md) |
+| Example fixture | [`examples/letter_set/writer_example.json`](examples/letter_set/writer_example.json) |
+| Main CLI | `hletterscriptgen` |
+| Code license | MIT |
+| Generated glyph rights | Per-variant rights inherited from upstream scans |
 
-## Position in the HeOCR system
+## What This Repository Owns
 
-`hletterscriptgen` reads rights-clean scans from
-`public-domain-hand-written-hebrew-scans`, produces per-writer letter
-sets that land in `hletterscript`, and ultimately feeds `hocrsyngen` /
-`hocrgen` / `HeOCR` / `HeOCRsynth`. See
-[`docs/repository_scope.md`](docs/repository_scope.md) for the full
-diagram and per-repo responsibilities.
+This repository contains the Python package, CLI, schema, and validation
+contracts for creating writer-level Hebrew letter sets. It does not host
+the published glyph dataset itself; generated and curated data belongs in
+[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript).
+
+What lives here:
+
+- `hletterscriptgen`, the Python package and command-line interface.
+- The `letter_set.v1` JSON Schema and fixture example.
+- Generation-profile parsing, upstream eligibility checks, glyph
+  extraction helpers, checksum calculation, and output validation.
+- CI, release workflow, and rights-carryover policy.
+
+What lives elsewhere:
+
+- Page-scan ingestion and rights curation:
+  [`HeOCR/hash`](https://github.com/HeOCR/hash).
+- Published letter-glyph image sets:
+  [`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript).
+- Synthetic document composition:
+  [`HeOCR/hocrsyngen`](https://github.com/HeOCR/hocrsyngen).
+- Dataset orchestration and release assembly:
+  [`HeOCR/hocrgen`](https://github.com/HeOCR/hocrgen),
+  [`HeOCR/HeOCR`](https://github.com/HeOCR/HeOCR), and
+  [`HeOCR/HeOCRsynth`](https://github.com/HeOCR/HeOCRsynth).
+
+## Pipeline Position
+
+```mermaid
+flowchart LR
+    HASH["HeOCR/hash<br/>rights-clean scans"] --> PROFILE["generation profile<br/>writer + glyph bboxes"]
+    PROFILE --> GEN["hletterscriptgen<br/>crop, hash, dedupe, validate"]
+    GEN --> DATA["HeOCR/hletterscript<br/>letter_set.v1 + glyph PNGs"]
+    DATA --> SYN["HeOCR/hocrsyngen<br/>synthetic pages"]
+    SYN --> OCR["HeOCR / HeOCRsynth<br/>OCR and HTR datasets"]
+```
+
+See [`docs/repository_scope.md`](docs/repository_scope.md) for the full
+ecosystem boundary map and per-repository responsibilities.
 
 ## Install
 
 ```bash
 python -m pip install -e ".[test]"
 ```
 
-Requires Python 3.11+.
+For development:
+
+```bash
+python -m pip install -e ".[dev]"
+```
+
+Requires Python 3.11 or newer.
 
 ## CLI
 
 ```bash
 hletterscriptgen version
 hletterscriptgen schema --format json
 hletterscriptgen validate examples/letter_set/writer_example.json
-hletterscriptgen validate examples/letter_set/writer_example.json --format json
+hletterscriptgen check-eligible path/to/hash/data/index/entries.jsonl
+hletterscriptgen scan-blobs path/to/scan.png --format json
+hletterscriptgen generate --profile generate_profile.json --output ./out
 ```
 
-The `generate` subcommand is reserved for the upcoming extraction pipeline
-and is not yet implemented; see [`docs/roadmap.md`](docs/roadmap.md).
+The `generate` command expects a human-curated generation profile that
+names writer IDs, upstream scan entries, and glyph bounding boxes. The
+output is one directory per writer, each containing `letter_set.json` and
+the surviving cropped glyph assets.
+
+## The `letter_set.v1` Contract
 
-## The `letter_set.v1` contract
+The bundled schema describes one writer's letter-glyph collection:
 
-The bundled JSON Schema describes a per-writer letter set:
+- `writer_id` identifies the writer set.
+- `writer_provenance` records how the writer attribution was established.
+- `upstream` pins the exact HASH revision used by the generation run.
+- `letters` maps Hebrew letters and final forms to one or more glyph
+  variants.
+- Each variant carries an asset path, checksum, image metadata, source
+  scan ID, bounding box, license, and rights evidence.
+- `license_summary` summarizes the distinct variant-level licenses but
+  does not replace per-variant rights metadata.
 
-- One document per **writer**.
-- `letters` maps each Hebrew letter (base or final form, `U+05D0`–`U+05EA`)
-  to one or more **variants** extracted from upstream scans by that writer.
-- Each variant carries an `asset_path`, a SHA-256 checksum, image metadata,
-  and **per-variant source rights**, so license evidence flows through from
-  upstream into any downstream composition.
+Read the full contract in [`docs/letter_set_v1.md`](docs/letter_set_v1.md).
 
-See [`docs/letter_set_v1.md`](docs/letter_set_v1.md) for the full
-explanation and field-by-field notes.
+## Validate Locally
+
+```bash
+python -m ruff check .
+python -m mypy
+python -m pytest
+hletterscriptgen validate examples/letter_set/writer_example.json
+```
 
 ## Licensing
 
-- Code in this repository: MIT (see [`LICENSE`](LICENSE)).
-- Generated letter sets: carry per-variant upstream rights — see
-  [`LICENSE-POLICY.md`](LICENSE-POLICY.md). The generator does not
-  relicense glyphs.
+- Code in this repository is MIT licensed. See [`LICENSE`](LICENSE).
+- Generated letter sets carry per-variant upstream rights. See
+  [`LICENSE-POLICY.md`](LICENSE-POLICY.md); the generator records rights
+  evidence but does not relicense glyphs.
 
 ## Contributing
 
-See [`CONTRIBUTING.md`](CONTRIBUTING.md). For agent collaborators, see
-[`AGENTS.md`](AGENTS.md).
+See [`CONTRIBUTING.md`](CONTRIBUTING.md). Agent collaborators should also
+read [`AGENTS.md`](AGENTS.md).
+
+## Credits
+
+Created by [Shay Palachy Affek ](http://www.shaypalachy.com/) [[GitHub](https://github.com/shaypal5)]
@@ -33,7 +33,7 @@ please report it. Two paths, in preference order:
    privacy.
 
 For takedown of an upstream scan, report directly in
-[`HeOCR/public-domain-hand-written-hebrew-scans`](https://github.com/HeOCR/public-domain-hand-written-hebrew-scans);
+[`HeOCR/hash`](https://github.com/HeOCR/hash);
 once an upstream scan is removed or relicensed, regenerated letter
 sets must drop or update the affected variants.
 
@@ -52,7 +52,7 @@ In scope:
 
 Out of scope here (report to the relevant upstream / downstream repo):
 
-- Rights records on upstream scans — `HeOCR/public-domain-hand-written-hebrew-scans`.
+- Rights records on upstream scans — `HeOCR/hash`.
 - Published letter-set datasets — `HeOCR/hletterscript`.
 - Composed synthetic pages — `HeOCR/hocrsyngen`.
 - Release-level governance — `HeOCR/hocrgen`.
@@ -1,14 +1,20 @@
 # Documentation index
 
+Created by [Shay Palachy Affek](http://www.shaypalachy.com/).
+
 - [Repository scope](repository_scope.md) — what this repo owns and what it does not, plus the canonical ecosystem diagram.
 - [Architecture](architecture.md) — code layout and the validation pipeline.
 - [`letter_set.v1` contract](letter_set_v1.md) — output schema, field-by-field.
 - [Upstream integration](upstream_integration.md) — how scans from
-  `public-domain-hand-written-hebrew-scans` feed in.
+  `hash` feed in.
 - [Downstream handoff](downstream_handoff.md) — how outputs land in
   `hletterscript` and onward.
 - [Roadmap](roadmap.md) — staged milestones beyond the scaffolding.
 
 ## Design drafts
 
-- [Letter extraction pipeline (draft)](design/letter_extraction.md) — sketch of the future `generate` pipeline; nothing implemented yet.
+- [Letter extraction pipeline](design/letter_extraction.md) — design notes for the `generate` pipeline and extraction workflow.
+
+## Credits
+
+Created by [Shay Palachy Affek ](http://www.shaypalachy.com/) [[GitHub](https://github.com/shaypal5)]
@@ -10,7 +10,7 @@
 ## Goal
 
 Turn rights-clean handwritten Hebrew page scans (upstream:
-`HeOCR/public-domain-hand-written-hebrew-scans`) into per-writer
+`HeOCR/hash`) into per-writer
 `letter_set.v1` documents plus their referenced glyph image assets.
 
 ## Sketch

@@ -19,16 +19,17 @@ path from Option A to Option B/C.
 
 ## Evidence from the upstream corpus
 
-Investigation target: `HeOCR/public-domain-hand-written-hebrew-scans` (GitHub, inspected via
-`gh api` — local clone not present at time of spike).
+Investigation target: `HeOCR/hash` — HASH (Hebrew Archive of Scanned Handwriting) (GitHub,
+inspected via `gh api` at spike time; corpus has grown considerably since).
 
 | Finding | Detail |
 |---------|--------|
-| Total entries in `data/index/entries.jsonl` | 60 |
-| `transcription.status` distribution | `"none"`: 60 / 60 |
-| Non-null `alto_path` | 0 / 60 |
-| Non-null `hocr_path` | 0 / 60 |
-| Non-null `text_path` | 0 / 60 |
+| Total entries in `data/index/entries.jsonl` (spike) | 60 |
+| Total entries (as of 2026-05) | 373 (111 sources, 48 unique creators) |
+| `transcription.status` distribution | `"none"`: all entries at spike time |
+| Non-null `alto_path` | 0 at spike time |
+| Non-null `hocr_path` | 0 at spike time |
+| Non-null `text_path` | 0 at spike time |
 | Unique `files[].role` values across all entries | `"original"` only |
 | Scan directories inspected (`data/scans/`) | `commons__begani_netatikha` (representative sample) — contains only the JPEG scan, no sidecars |
 
@@ -38,9 +39,9 @@ The file-role enum (`original`, `normalized`, `thumbnail`, `transcription`, `met
 `transcription` role, but zero entries exercise it.
 
 Source files consulted:
-- `HeOCR/public-domain-hand-written-hebrew-scans/schemas/entry.schema.json`
-- `HeOCR/public-domain-hand-written-hebrew-scans/data/index/entries.jsonl`
-- `HeOCR/public-domain-hand-written-hebrew-scans/data/scans/commons__begani_netatikha/`
+- `HeOCR/hash/schemas/entry.schema.json`
+- `HeOCR/hash/data/index/entries.jsonl`
+- `HeOCR/hash/data/scans/commons__begani_netatikha/`
 
 ---
 

@@ -19,7 +19,7 @@ the local upstream checkout and declares one or more writer blocks.
 
 ```json
 {
-  "upstream_path": "../public-domain-hand-written-hebrew-scans",
+  "upstream_path": "../hash",
   "writers": [
     {
       "writer_id": "writer_bialik",

@@ -24,7 +24,7 @@ is exercised by CI and must remain valid.
   },
   "generated_at": "2026-05-12T00:00:00Z",
   "upstream": {
-    "repo": "HeOCR/public-domain-hand-written-hebrew-scans",
+    "repo": "HeOCR/hash",
     "revision": "<git commit sha>"
   },
   "letters": {
@@ -58,7 +58,7 @@ labels only. If in doubt, omit.
 
 **Required.** Records how the writer identity was established and which
 upstream scan entries are attributed to them. `source_repo` is normally
-`HeOCR/public-domain-hand-written-hebrew-scans`; `source_entry_ids` are
+`HeOCR/hash`; `source_entry_ids` are
 the upstream `entries.jsonl` ids. `attribution_method` is a short tag
 (e.g. `collection_metadata`, `manual_review`, `fixture`).
 
@@ -115,7 +115,7 @@ A mapping from a single Hebrew letter character (base or final form,
 | `asset_path` | POSIX path relative to the letter-set root. No leading `/` (schema-enforced); no `..` segment (cross-field-enforced). |
 | `checksum_sha256` | Lowercase SHA-256 hex digest of the asset bytes. Real letter sets must use real checksums; the example fixture's all-zero/all-one digests are intentional placeholders. |
 | `image.{width_px,height_px,format}` | Image metadata. `format` ∈ `png`, `webp`, `tiff`. |
-| `source.scan_entry_id` | Upstream entry id (resolves in `public-domain-hand-written-hebrew-scans`). Cross-field validator checks it appears in `writer_provenance.source_entry_ids`. |
+| `source.scan_entry_id` | Upstream entry id (resolves in `hash`). Cross-field validator checks it appears in `writer_provenance.source_entry_ids`. |
 | `source.scan_url` | Optional URL pointer to the source scan. RFC 3986 URI; checked when format-checking is enabled. |
 | `source.license` | One of the accepted SPDX / `LicenseRef-*` identifiers (see `$defs.license_id` in the schema). Extending the allow-list requires a schema change. |
 | `source.rights_evidence` | Optional free-form note or URL with rights evidence. |

@@ -7,7 +7,7 @@ intentionally narrow.
 ## Position in the HeOCR system (canonical)
 
 ```
-public-domain-hand-written-hebrew-scans   (full-page scans, PD / CC / CC-BY)
+hash   (full-page scans, PD / CC / CC-BY)
         │
         ▼
 hletterscriptgen   (code/framework — this repo)
@@ -39,7 +39,7 @@ copy it — only one diagram should ever rot.
 
 | Concern | Where it lives |
 | --- | --- |
-| Hosting page scans and rights records | `HeOCR/public-domain-hand-written-hebrew-scans` |
+| Hosting page scans and rights records | `HeOCR/hash` |
 | Hosting per-writer letter-glyph datasets | `HeOCR/hletterscript` |
 | Composing synthetic Hebrew handwritten pages | `HeOCR/hocrsyngen` |
 | Dataset orchestration, governance, release assembly, publication | `HeOCR/hocrgen` |

@@ -1,7 +1,7 @@
 # Upstream integration
 
 `hletterscriptgen` consumes scans from
-[`HeOCR/public-domain-hand-written-hebrew-scans`](https://github.com/HeOCR/public-domain-hand-written-hebrew-scans).
+[`HeOCR/hash`](https://github.com/HeOCR/hash) — HASH (Hebrew Archive of Scanned Handwriting).
 That upstream repo holds the authoritative rights records; this repo
 defers to them.