diff --git a/README.md b/README.md index 6f76298..66ddedb 100644 --- a/README.md +++ b/README.md @@ -1,52 +1,73 @@ # hletterscriptgen -Generator framework for **per-writer Hebrew letter-glyph image sets**, built -on rights-clean upstream scans of handwritten Hebrew documents. - -`hletterscriptgen` is part of the [HeOCR](https://github.com/HeOCR) project. -It consumes scan-level records from -[`HeOCR/hash`](https://github.com/HeOCR/hash) (HASH — Hebrew Archive of Scanned Handwriting) -and produces letter-set datasets that land in -[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript). Downstream, -[`HeOCR/hocrsyngen`](https://github.com/HeOCR/hocrsyngen) composes those -glyphs into synthetic Hebrew handwritten pages, which -[`HeOCR/hocrgen`](https://github.com/HeOCR/hocrgen) folds into -[`HeOCR/HeOCR`](https://github.com/HeOCR/HeOCR) and -[`HeOCR/HeOCRsynth`](https://github.com/HeOCR/HeOCRsynth). - -## Repository scope - -This repository contains the **code, schemas, and contracts** that produce -letter sets. It does **not** host the letter-set image data; that lives in -`HeOCR/hletterscript`. +[![CI](https://github.com/HeOCR/hletterscriptgen/actions/workflows/ci.yml/badge.svg)](https://github.com/HeOCR/hletterscriptgen/actions/workflows/ci.yml) +![License](https://img.shields.io/badge/license-MIT-green) +![Python](https://img.shields.io/badge/python-3.11%2B-blue) +![Schema](https://img.shields.io/badge/schema-letter__set.v1-blue) -What lives here: +Created by [Shay Palachy Affek](http://www.shaypalachy.com/). + +Generator framework for per-writer Hebrew handwritten letter-glyph image +sets. It turns rights-clean HASH scan records plus human-reviewed glyph +annotations into deterministic `letter_set.v1` outputs for +[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript). -- A Python package and CLI (`hletterscriptgen`). -- The `letter_set.v1` JSON Schema and a fixture example - (`examples/letter_set/writer_example.json`). -- Validation tooling (`hletterscriptgen validate`). -- CI that enforces schema and tooling invariants. -- Licensing policy and rights-carryover rules - ([`LICENSE-POLICY.md`](LICENSE-POLICY.md)). +![hletterscriptgen output contract diagram](docs/assets/hletterscriptgen-output-contract.png) -What does **not** live here: +## At a Glance -- Actual extracted glyph images (→ `HeOCR/hletterscript`). -- Page-scan ingestion or rights curation (→ - `HeOCR/hash`). -- Document composition (→ `HeOCR/hocrsyngen`). -- Dataset orchestration, governance, release assembly, or publication (→ - `HeOCR/hocrgen` / `HeOCR/HeOCR` / `HeOCR/HeOCRsynth`). +| Field | Value | +| --- | --- | +| Role | Generate and validate per-writer Hebrew letter-set outputs | +| Input | HASH scan metadata, scan image files, and generation profiles | +| Output | `letter_set.v1` JSON plus cropped glyph PNG assets | +| Public contract | [`docs/letter_set_v1.md`](docs/letter_set_v1.md) | +| Example fixture | [`examples/letter_set/writer_example.json`](examples/letter_set/writer_example.json) | +| Main CLI | `hletterscriptgen` | +| Code license | MIT | +| Generated glyph rights | Per-variant rights inherited from upstream scans | -## Position in the HeOCR system +## What This Repository Owns -`hletterscriptgen` reads rights-clean scans from -HASH (`HeOCR/hash`), produces per-writer letter -sets that land in `hletterscript`, and ultimately feeds `hocrsyngen` / -`hocrgen` / `HeOCR` / `HeOCRsynth`. See -[`docs/repository_scope.md`](docs/repository_scope.md) for the full -diagram and per-repo responsibilities. +This repository contains the Python package, CLI, schema, and validation +contracts for creating writer-level Hebrew letter sets. It does not host +the published glyph dataset itself; generated and curated data belongs in +[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript). + +What lives here: + +- `hletterscriptgen`, the Python package and command-line interface. +- The `letter_set.v1` JSON Schema and fixture example. +- Generation-profile parsing, upstream eligibility checks, glyph + extraction helpers, checksum calculation, and output validation. +- CI, release workflow, and rights-carryover policy. + +What lives elsewhere: + +- Page-scan ingestion and rights curation: + [`HeOCR/hash`](https://github.com/HeOCR/hash). +- Published letter-glyph image sets: + [`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript). +- Synthetic document composition: + [`HeOCR/hocrsyngen`](https://github.com/HeOCR/hocrsyngen). +- Dataset orchestration and release assembly: + [`HeOCR/hocrgen`](https://github.com/HeOCR/hocrgen), + [`HeOCR/HeOCR`](https://github.com/HeOCR/HeOCR), and + [`HeOCR/HeOCRsynth`](https://github.com/HeOCR/HeOCRsynth). + +## Pipeline Position + +```mermaid +flowchart LR + HASH["HeOCR/hash
rights-clean scans"] --> PROFILE["generation profile
writer + glyph bboxes"] + PROFILE --> GEN["hletterscriptgen
crop, hash, dedupe, validate"] + GEN --> DATA["HeOCR/hletterscript
letter_set.v1 + glyph PNGs"] + DATA --> SYN["HeOCR/hocrsyngen
synthetic pages"] + SYN --> OCR["HeOCR / HeOCRsynth
OCR and HTR datasets"] +``` + +See [`docs/repository_scope.md`](docs/repository_scope.md) for the full +ecosystem boundary map and per-repository responsibilities. ## Install @@ -54,7 +75,13 @@ diagram and per-repo responsibilities. python -m pip install -e ".[test]" ``` -Requires Python 3.11+. +For development: + +```bash +python -m pip install -e ".[dev]" +``` + +Requires Python 3.11 or newer. ## CLI @@ -62,34 +89,53 @@ Requires Python 3.11+. hletterscriptgen version hletterscriptgen schema --format json hletterscriptgen validate examples/letter_set/writer_example.json -hletterscriptgen validate examples/letter_set/writer_example.json --format json +hletterscriptgen check-eligible path/to/hash/data/index/entries.jsonl +hletterscriptgen scan-blobs path/to/scan.png --format json +hletterscriptgen generate --profile generate_profile.json --output ./out ``` -The `generate` subcommand is reserved for the upcoming extraction pipeline -and is not yet implemented; see [`docs/roadmap.md`](docs/roadmap.md). +The `generate` command expects a human-curated generation profile that +names writer IDs, upstream scan entries, and glyph bounding boxes. The +output is one directory per writer, each containing `letter_set.json` and +the surviving cropped glyph assets. + +## The `letter_set.v1` Contract -## The `letter_set.v1` contract +The bundled schema describes one writer's letter-glyph collection: -The bundled JSON Schema describes a per-writer letter set: +- `writer_id` identifies the writer set. +- `writer_provenance` records how the writer attribution was established. +- `upstream` pins the exact HASH revision used by the generation run. +- `letters` maps Hebrew letters and final forms to one or more glyph + variants. +- Each variant carries an asset path, checksum, image metadata, source + scan ID, bounding box, license, and rights evidence. +- `license_summary` summarizes the distinct variant-level licenses but + does not replace per-variant rights metadata. -- One document per **writer**. -- `letters` maps each Hebrew letter (base or final form, `U+05D0`–`U+05EA`) - to one or more **variants** extracted from upstream scans by that writer. -- Each variant carries an `asset_path`, a SHA-256 checksum, image metadata, - and **per-variant source rights**, so license evidence flows through from - upstream into any downstream composition. +Read the full contract in [`docs/letter_set_v1.md`](docs/letter_set_v1.md). -See [`docs/letter_set_v1.md`](docs/letter_set_v1.md) for the full -explanation and field-by-field notes. +## Validate Locally + +```bash +python -m ruff check . +python -m mypy +python -m pytest +hletterscriptgen validate examples/letter_set/writer_example.json +``` ## Licensing -- Code in this repository: MIT (see [`LICENSE`](LICENSE)). -- Generated letter sets: carry per-variant upstream rights — see - [`LICENSE-POLICY.md`](LICENSE-POLICY.md). The generator does not - relicense glyphs. +- Code in this repository is MIT licensed. See [`LICENSE`](LICENSE). +- Generated letter sets carry per-variant upstream rights. See + [`LICENSE-POLICY.md`](LICENSE-POLICY.md); the generator records rights + evidence but does not relicense glyphs. ## Contributing -See [`CONTRIBUTING.md`](CONTRIBUTING.md). For agent collaborators, see -[`AGENTS.md`](AGENTS.md). +See [`CONTRIBUTING.md`](CONTRIBUTING.md). Agent collaborators should also +read [`AGENTS.md`](AGENTS.md). + +## Credits + +Created by [Shay Palachy Affek ](http://www.shaypalachy.com/) [[GitHub](https://github.com/shaypal5)] diff --git a/docs/README.md b/docs/README.md index 69e3ae8..fe663ce 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,5 +1,7 @@ # Documentation index +Created by [Shay Palachy Affek](http://www.shaypalachy.com/). + - [Repository scope](repository_scope.md) — what this repo owns and what it does not, plus the canonical ecosystem diagram. - [Architecture](architecture.md) — code layout and the validation pipeline. - [`letter_set.v1` contract](letter_set_v1.md) — output schema, field-by-field. @@ -11,4 +13,8 @@ ## Design drafts -- [Letter extraction pipeline (draft)](design/letter_extraction.md) — sketch of the future `generate` pipeline; nothing implemented yet. +- [Letter extraction pipeline](design/letter_extraction.md) — design notes for the `generate` pipeline and extraction workflow. + +## Credits + +Created by [Shay Palachy Affek ](http://www.shaypalachy.com/) [[GitHub](https://github.com/shaypal5)] diff --git a/docs/assets/hletterscriptgen-output-contract.png b/docs/assets/hletterscriptgen-output-contract.png new file mode 100644 index 0000000..4c52612 Binary files /dev/null and b/docs/assets/hletterscriptgen-output-contract.png differ