Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
168 changes: 107 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,95 +1,141 @@
# hletterscriptgen

Generator framework for **per-writer Hebrew letter-glyph image sets**, built
on rights-clean upstream scans of handwritten Hebrew documents.

`hletterscriptgen` is part of the [HeOCR](https://github.com/HeOCR) project.
It consumes scan-level records from
[`HeOCR/hash`](https://github.com/HeOCR/hash) (HASH — Hebrew Archive of Scanned Handwriting)
and produces letter-set datasets that land in
[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript). Downstream,
[`HeOCR/hocrsyngen`](https://github.com/HeOCR/hocrsyngen) composes those
glyphs into synthetic Hebrew handwritten pages, which
[`HeOCR/hocrgen`](https://github.com/HeOCR/hocrgen) folds into
[`HeOCR/HeOCR`](https://github.com/HeOCR/HeOCR) and
[`HeOCR/HeOCRsynth`](https://github.com/HeOCR/HeOCRsynth).

## Repository scope

This repository contains the **code, schemas, and contracts** that produce
letter sets. It does **not** host the letter-set image data; that lives in
`HeOCR/hletterscript`.
[![CI](https://github.com/HeOCR/hletterscriptgen/actions/workflows/ci.yml/badge.svg)](https://github.com/HeOCR/hletterscriptgen/actions/workflows/ci.yml)
![License](https://img.shields.io/badge/license-MIT-green)
![Python](https://img.shields.io/badge/python-3.11%2B-blue)
![Schema](https://img.shields.io/badge/schema-letter__set.v1-blue)

What lives here:
Created by [Shay Palachy Affek](http://www.shaypalachy.com/).

Generator framework for per-writer Hebrew handwritten letter-glyph image
sets. It turns rights-clean HASH scan records plus human-reviewed glyph
annotations into deterministic `letter_set.v1` outputs for
[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript).

- A Python package and CLI (`hletterscriptgen`).
- The `letter_set.v1` JSON Schema and a fixture example
(`examples/letter_set/writer_example.json`).
- Validation tooling (`hletterscriptgen validate`).
- CI that enforces schema and tooling invariants.
- Licensing policy and rights-carryover rules
([`LICENSE-POLICY.md`](LICENSE-POLICY.md)).
![hletterscriptgen output contract diagram](docs/assets/hletterscriptgen-output-contract.png)

What does **not** live here:
## At a Glance

- Actual extracted glyph images (→ `HeOCR/hletterscript`).
- Page-scan ingestion or rights curation (→
`HeOCR/hash`).
- Document composition (→ `HeOCR/hocrsyngen`).
- Dataset orchestration, governance, release assembly, or publication (→
`HeOCR/hocrgen` / `HeOCR/HeOCR` / `HeOCR/HeOCRsynth`).
| Field | Value |
| --- | --- |
| Role | Generate and validate per-writer Hebrew letter-set outputs |
| Input | HASH scan metadata, scan image files, and generation profiles |
| Output | `letter_set.v1` JSON plus cropped glyph PNG assets |
| Public contract | [`docs/letter_set_v1.md`](docs/letter_set_v1.md) |
| Example fixture | [`examples/letter_set/writer_example.json`](examples/letter_set/writer_example.json) |
| Main CLI | `hletterscriptgen` |
| Code license | MIT |
| Generated glyph rights | Per-variant rights inherited from upstream scans |

## Position in the HeOCR system
## What This Repository Owns

`hletterscriptgen` reads rights-clean scans from
HASH (`HeOCR/hash`), produces per-writer letter
sets that land in `hletterscript`, and ultimately feeds `hocrsyngen` /
`hocrgen` / `HeOCR` / `HeOCRsynth`. See
[`docs/repository_scope.md`](docs/repository_scope.md) for the full
diagram and per-repo responsibilities.
This repository contains the Python package, CLI, schema, and validation
contracts for creating writer-level Hebrew letter sets. It does not host
the published glyph dataset itself; generated and curated data belongs in
[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript).

What lives here:

- `hletterscriptgen`, the Python package and command-line interface.
- The `letter_set.v1` JSON Schema and fixture example.
- Generation-profile parsing, upstream eligibility checks, glyph
extraction helpers, checksum calculation, and output validation.
- CI, release workflow, and rights-carryover policy.

What lives elsewhere:

- Page-scan ingestion and rights curation:
[`HeOCR/hash`](https://github.com/HeOCR/hash).
- Published letter-glyph image sets:
[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript).
- Synthetic document composition:
[`HeOCR/hocrsyngen`](https://github.com/HeOCR/hocrsyngen).
- Dataset orchestration and release assembly:
[`HeOCR/hocrgen`](https://github.com/HeOCR/hocrgen),
[`HeOCR/HeOCR`](https://github.com/HeOCR/HeOCR), and
[`HeOCR/HeOCRsynth`](https://github.com/HeOCR/HeOCRsynth).

## Pipeline Position

```mermaid
flowchart LR
HASH["HeOCR/hash<br/>rights-clean scans"] --> PROFILE["generation profile<br/>writer + glyph bboxes"]
PROFILE --> GEN["hletterscriptgen<br/>crop, hash, dedupe, validate"]
GEN --> DATA["HeOCR/hletterscript<br/>letter_set.v1 + glyph PNGs"]
DATA --> SYN["HeOCR/hocrsyngen<br/>synthetic pages"]
SYN --> OCR["HeOCR / HeOCRsynth<br/>OCR and HTR datasets"]
```

See [`docs/repository_scope.md`](docs/repository_scope.md) for the full
ecosystem boundary map and per-repository responsibilities.

## Install

```bash
python -m pip install -e ".[test]"
```

Requires Python 3.11+.
For development:

```bash
python -m pip install -e ".[dev]"
```

Requires Python 3.11 or newer.

## CLI

```bash
hletterscriptgen version
hletterscriptgen schema --format json
hletterscriptgen validate examples/letter_set/writer_example.json
hletterscriptgen validate examples/letter_set/writer_example.json --format json
hletterscriptgen check-eligible path/to/hash/data/index/entries.jsonl
hletterscriptgen scan-blobs path/to/scan.png --format json
hletterscriptgen generate --profile generate_profile.json --output ./out
```

The `generate` subcommand is reserved for the upcoming extraction pipeline
and is not yet implemented; see [`docs/roadmap.md`](docs/roadmap.md).
The `generate` command expects a human-curated generation profile that
names writer IDs, upstream scan entries, and glyph bounding boxes. The
output is one directory per writer, each containing `letter_set.json` and
the surviving cropped glyph assets.

## The `letter_set.v1` Contract

## The `letter_set.v1` contract
The bundled schema describes one writer's letter-glyph collection:

The bundled JSON Schema describes a per-writer letter set:
- `writer_id` identifies the writer set.
- `writer_provenance` records how the writer attribution was established.
- `upstream` pins the exact HASH revision used by the generation run.
- `letters` maps Hebrew letters and final forms to one or more glyph
variants.
- Each variant carries an asset path, checksum, image metadata, source
scan ID, bounding box, license, and rights evidence.
- `license_summary` summarizes the distinct variant-level licenses but
does not replace per-variant rights metadata.

- One document per **writer**.
- `letters` maps each Hebrew letter (base or final form, `U+05D0`–`U+05EA`)
to one or more **variants** extracted from upstream scans by that writer.
- Each variant carries an `asset_path`, a SHA-256 checksum, image metadata,
and **per-variant source rights**, so license evidence flows through from
upstream into any downstream composition.
Read the full contract in [`docs/letter_set_v1.md`](docs/letter_set_v1.md).

See [`docs/letter_set_v1.md`](docs/letter_set_v1.md) for the full
explanation and field-by-field notes.
## Validate Locally

```bash
python -m ruff check .
python -m mypy
python -m pytest
hletterscriptgen validate examples/letter_set/writer_example.json
```

## Licensing

- Code in this repository: MIT (see [`LICENSE`](LICENSE)).
- Generated letter sets: carry per-variant upstream rights — see
[`LICENSE-POLICY.md`](LICENSE-POLICY.md). The generator does not
relicense glyphs.
- Code in this repository is MIT licensed. See [`LICENSE`](LICENSE).
- Generated letter sets carry per-variant upstream rights. See
[`LICENSE-POLICY.md`](LICENSE-POLICY.md); the generator records rights
evidence but does not relicense glyphs.

## Contributing

See [`CONTRIBUTING.md`](CONTRIBUTING.md). For agent collaborators, see
[`AGENTS.md`](AGENTS.md).
See [`CONTRIBUTING.md`](CONTRIBUTING.md). Agent collaborators should also
read [`AGENTS.md`](AGENTS.md).

## Credits

Created by [Shay Palachy Affek ](http://www.shaypalachy.com/) [[GitHub](https://github.com/shaypal5)]
8 changes: 7 additions & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Documentation index

Created by [Shay Palachy Affek](http://www.shaypalachy.com/).

- [Repository scope](repository_scope.md) — what this repo owns and what it does not, plus the canonical ecosystem diagram.
- [Architecture](architecture.md) — code layout and the validation pipeline.
- [`letter_set.v1` contract](letter_set_v1.md) — output schema, field-by-field.
Expand All @@ -11,4 +13,8 @@

## Design drafts

- [Letter extraction pipeline (draft)](design/letter_extraction.md) — sketch of the future `generate` pipeline; nothing implemented yet.
- [Letter extraction pipeline](design/letter_extraction.md) — design notes for the `generate` pipeline and extraction workflow.

## Credits

Created by [Shay Palachy Affek ](http://www.shaypalachy.com/) [[GitHub](https://github.com/shaypal5)]
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.