Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ hletterscriptgen validate examples/letter_set/writer_example.json --format json
scans into per-writer letter-glyph image sets.
- `hletterscript` (separate repo) owns the **published letter-set datasets**.
Do not commit generated glyph images to this repo.
- `public-domain-hand-written-hebrew-scans` (separate repo) owns
- `hash` (separate repo) owns
**upstream scans** and their rights records.
- `hocrsyngen`, `hocrgen`, `HeOCR`, `HeOCRsynth` are downstream consumers.
Do not import them from `hletterscriptgen` and do not build their
Expand All @@ -44,7 +44,7 @@ hletterscriptgen validate examples/letter_set/writer_example.json --format json
## Rights-carryover rules

- Every variant must carry a `source.scan_entry_id` that resolves against
the upstream `public-domain-hand-written-hebrew-scans` index, plus a
the upstream `hash` index, plus a
`source.license` matching the upstream record. The generator never
invents, broadens, or relicenses upstream rights.
- `license_summary.licenses` must include every distinct license that
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Thanks for considering a contribution to `hletterscriptgen`.
This repo holds the **code** that produces per-writer Hebrew letter-glyph
image sets. It does **not** host the letter-set images themselves (those
live in `HeOCR/hletterscript`), and it does **not** ingest upstream scans
(those live in `HeOCR/public-domain-hand-written-hebrew-scans`). Please
(those live in `HeOCR/hash`). Please
keep PRs aligned with that boundary; cross-repo concerns belong upstream
or downstream.

Expand Down
2 changes: 1 addition & 1 deletion LICENSE-POLICY.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ rules apply to each layer.
## 2. Generated letter-set datasets

The generator processes scans from
[`HeOCR/public-domain-hand-written-hebrew-scans`](https://github.com/HeOCR/public-domain-hand-written-hebrew-scans).
[`HeOCR/hash`](https://github.com/HeOCR/hash).
That upstream repository uses a compound licensing model with rights recorded
**per scan**. `hletterscriptgen` follows the same posture:

Expand Down
168 changes: 107 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,95 +1,141 @@
# hletterscriptgen

Generator framework for **per-writer Hebrew letter-glyph image sets**, built
on rights-clean upstream scans of handwritten Hebrew documents.

`hletterscriptgen` is part of the [HeOCR](https://github.com/HeOCR) project.
It consumes scan-level records from
[`HeOCR/public-domain-hand-written-hebrew-scans`](https://github.com/HeOCR/public-domain-hand-written-hebrew-scans)
and produces letter-set datasets that land in
[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript). Downstream,
[`HeOCR/hocrsyngen`](https://github.com/HeOCR/hocrsyngen) composes those
glyphs into synthetic Hebrew handwritten pages, which
[`HeOCR/hocrgen`](https://github.com/HeOCR/hocrgen) folds into
[`HeOCR/HeOCR`](https://github.com/HeOCR/HeOCR) and
[`HeOCR/HeOCRsynth`](https://github.com/HeOCR/HeOCRsynth).

## Repository scope

This repository contains the **code, schemas, and contracts** that produce
letter sets. It does **not** host the letter-set image data; that lives in
`HeOCR/hletterscript`.
[![CI](https://github.com/HeOCR/hletterscriptgen/actions/workflows/ci.yml/badge.svg)](https://github.com/HeOCR/hletterscriptgen/actions/workflows/ci.yml)
![License](https://img.shields.io/badge/license-MIT-green)
![Python](https://img.shields.io/badge/python-3.11%2B-blue)
![Schema](https://img.shields.io/badge/schema-letter__set.v1-blue)

What lives here:
Created by [Shay Palachy Affek](http://www.shaypalachy.com/).

Generator framework for per-writer Hebrew handwritten letter-glyph image
sets. It turns rights-clean HASH scan records plus human-reviewed glyph
annotations into deterministic `letter_set.v1` outputs for
[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript).

- A Python package and CLI (`hletterscriptgen`).
- The `letter_set.v1` JSON Schema and a fixture example
(`examples/letter_set/writer_example.json`).
- Validation tooling (`hletterscriptgen validate`).
- CI that enforces schema and tooling invariants.
- Licensing policy and rights-carryover rules
([`LICENSE-POLICY.md`](LICENSE-POLICY.md)).
![hletterscriptgen output contract diagram](docs/assets/hletterscriptgen-output-contract.png)

What does **not** live here:
## At a Glance

- Actual extracted glyph images (→ `HeOCR/hletterscript`).
- Page-scan ingestion or rights curation (→
`HeOCR/public-domain-hand-written-hebrew-scans`).
- Document composition (→ `HeOCR/hocrsyngen`).
- Dataset orchestration, governance, release assembly, or publication (→
`HeOCR/hocrgen` / `HeOCR/HeOCR` / `HeOCR/HeOCRsynth`).
| Field | Value |
| --- | --- |
| Role | Generate and validate per-writer Hebrew letter-set outputs |
| Input | HASH scan metadata, scan image files, and generation profiles |
| Output | `letter_set.v1` JSON plus cropped glyph PNG assets |
| Public contract | [`docs/letter_set_v1.md`](docs/letter_set_v1.md) |
| Example fixture | [`examples/letter_set/writer_example.json`](examples/letter_set/writer_example.json) |
| Main CLI | `hletterscriptgen` |
| Code license | MIT |
| Generated glyph rights | Per-variant rights inherited from upstream scans |

## Position in the HeOCR system
## What This Repository Owns

`hletterscriptgen` reads rights-clean scans from
`public-domain-hand-written-hebrew-scans`, produces per-writer letter
sets that land in `hletterscript`, and ultimately feeds `hocrsyngen` /
`hocrgen` / `HeOCR` / `HeOCRsynth`. See
[`docs/repository_scope.md`](docs/repository_scope.md) for the full
diagram and per-repo responsibilities.
This repository contains the Python package, CLI, schema, and validation
contracts for creating writer-level Hebrew letter sets. It does not host
the published glyph dataset itself; generated and curated data belongs in
[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript).

What lives here:

- `hletterscriptgen`, the Python package and command-line interface.
- The `letter_set.v1` JSON Schema and fixture example.
- Generation-profile parsing, upstream eligibility checks, glyph
extraction helpers, checksum calculation, and output validation.
- CI, release workflow, and rights-carryover policy.

What lives elsewhere:

- Page-scan ingestion and rights curation:
[`HeOCR/hash`](https://github.com/HeOCR/hash).
- Published letter-glyph image sets:
[`HeOCR/hletterscript`](https://github.com/HeOCR/hletterscript).
- Synthetic document composition:
[`HeOCR/hocrsyngen`](https://github.com/HeOCR/hocrsyngen).
- Dataset orchestration and release assembly:
[`HeOCR/hocrgen`](https://github.com/HeOCR/hocrgen),
[`HeOCR/HeOCR`](https://github.com/HeOCR/HeOCR), and
[`HeOCR/HeOCRsynth`](https://github.com/HeOCR/HeOCRsynth).

## Pipeline Position

```mermaid
flowchart LR
HASH["HeOCR/hash<br/>rights-clean scans"] --> PROFILE["generation profile<br/>writer + glyph bboxes"]
PROFILE --> GEN["hletterscriptgen<br/>crop, hash, dedupe, validate"]
GEN --> DATA["HeOCR/hletterscript<br/>letter_set.v1 + glyph PNGs"]
DATA --> SYN["HeOCR/hocrsyngen<br/>synthetic pages"]
SYN --> OCR["HeOCR / HeOCRsynth<br/>OCR and HTR datasets"]
```

See [`docs/repository_scope.md`](docs/repository_scope.md) for the full
ecosystem boundary map and per-repository responsibilities.

## Install

```bash
python -m pip install -e ".[test]"
```

Requires Python 3.11+.
For development:

```bash
python -m pip install -e ".[dev]"
```

Requires Python 3.11 or newer.

## CLI

```bash
hletterscriptgen version
hletterscriptgen schema --format json
hletterscriptgen validate examples/letter_set/writer_example.json
hletterscriptgen validate examples/letter_set/writer_example.json --format json
hletterscriptgen check-eligible path/to/hash/data/index/entries.jsonl
hletterscriptgen scan-blobs path/to/scan.png --format json
hletterscriptgen generate --profile generate_profile.json --output ./out
```

The `generate` subcommand is reserved for the upcoming extraction pipeline
and is not yet implemented; see [`docs/roadmap.md`](docs/roadmap.md).
The `generate` command expects a human-curated generation profile that
names writer IDs, upstream scan entries, and glyph bounding boxes. The
output is one directory per writer, each containing `letter_set.json` and
the surviving cropped glyph assets.

## The `letter_set.v1` Contract

## The `letter_set.v1` contract
The bundled schema describes one writer's letter-glyph collection:

The bundled JSON Schema describes a per-writer letter set:
- `writer_id` identifies the writer set.
- `writer_provenance` records how the writer attribution was established.
- `upstream` pins the exact HASH revision used by the generation run.
- `letters` maps Hebrew letters and final forms to one or more glyph
variants.
- Each variant carries an asset path, checksum, image metadata, source
scan ID, bounding box, license, and rights evidence.
- `license_summary` summarizes the distinct variant-level licenses but
does not replace per-variant rights metadata.

- One document per **writer**.
- `letters` maps each Hebrew letter (base or final form, `U+05D0`–`U+05EA`)
to one or more **variants** extracted from upstream scans by that writer.
- Each variant carries an `asset_path`, a SHA-256 checksum, image metadata,
and **per-variant source rights**, so license evidence flows through from
upstream into any downstream composition.
Read the full contract in [`docs/letter_set_v1.md`](docs/letter_set_v1.md).

See [`docs/letter_set_v1.md`](docs/letter_set_v1.md) for the full
explanation and field-by-field notes.
## Validate Locally

```bash
python -m ruff check .
python -m mypy
python -m pytest
hletterscriptgen validate examples/letter_set/writer_example.json
```

## Licensing

- Code in this repository: MIT (see [`LICENSE`](LICENSE)).
- Generated letter sets: carry per-variant upstream rights — see
[`LICENSE-POLICY.md`](LICENSE-POLICY.md). The generator does not
relicense glyphs.
- Code in this repository is MIT licensed. See [`LICENSE`](LICENSE).
- Generated letter sets carry per-variant upstream rights. See
[`LICENSE-POLICY.md`](LICENSE-POLICY.md); the generator records rights
evidence but does not relicense glyphs.

## Contributing

See [`CONTRIBUTING.md`](CONTRIBUTING.md). For agent collaborators, see
[`AGENTS.md`](AGENTS.md).
See [`CONTRIBUTING.md`](CONTRIBUTING.md). Agent collaborators should also
read [`AGENTS.md`](AGENTS.md).

## Credits

Created by [Shay Palachy Affek ](http://www.shaypalachy.com/) [[GitHub](https://github.com/shaypal5)]
4 changes: 2 additions & 2 deletions SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ please report it. Two paths, in preference order:
privacy.

For takedown of an upstream scan, report directly in
[`HeOCR/public-domain-hand-written-hebrew-scans`](https://github.com/HeOCR/public-domain-hand-written-hebrew-scans);
[`HeOCR/hash`](https://github.com/HeOCR/hash);
once an upstream scan is removed or relicensed, regenerated letter
sets must drop or update the affected variants.

Expand All @@ -52,7 +52,7 @@ In scope:

Out of scope here (report to the relevant upstream / downstream repo):

- Rights records on upstream scans — `HeOCR/public-domain-hand-written-hebrew-scans`.
- Rights records on upstream scans — `HeOCR/hash`.
- Published letter-set datasets — `HeOCR/hletterscript`.
- Composed synthetic pages — `HeOCR/hocrsyngen`.
- Release-level governance — `HeOCR/hocrgen`.
10 changes: 8 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,20 @@
# Documentation index

Created by [Shay Palachy Affek](http://www.shaypalachy.com/).

- [Repository scope](repository_scope.md) — what this repo owns and what it does not, plus the canonical ecosystem diagram.
- [Architecture](architecture.md) — code layout and the validation pipeline.
- [`letter_set.v1` contract](letter_set_v1.md) — output schema, field-by-field.
- [Upstream integration](upstream_integration.md) — how scans from
`public-domain-hand-written-hebrew-scans` feed in.
`hash` feed in.
- [Downstream handoff](downstream_handoff.md) — how outputs land in
`hletterscript` and onward.
- [Roadmap](roadmap.md) — staged milestones beyond the scaffolding.

## Design drafts

- [Letter extraction pipeline (draft)](design/letter_extraction.md) — sketch of the future `generate` pipeline; nothing implemented yet.
- [Letter extraction pipeline](design/letter_extraction.md) — design notes for the `generate` pipeline and extraction workflow.

## Credits

Created by [Shay Palachy Affek ](http://www.shaypalachy.com/) [[GitHub](https://github.com/shaypal5)]
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/design/letter_extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
## Goal

Turn rights-clean handwritten Hebrew page scans (upstream:
`HeOCR/public-domain-hand-written-hebrew-scans`) into per-writer
`HeOCR/hash`) into per-writer
`letter_set.v1` documents plus their referenced glyph image assets.

## Sketch
Expand Down
21 changes: 11 additions & 10 deletions docs/design/segmentation-approach.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,16 +19,17 @@ path from Option A to Option B/C.

## Evidence from the upstream corpus

Investigation target: `HeOCR/public-domain-hand-written-hebrew-scans` (GitHub, inspected via
`gh api` — local clone not present at time of spike).
Investigation target: `HeOCR/hash` — HASH (Hebrew Archive of Scanned Handwriting) (GitHub,
inspected via `gh api` at spike time; corpus has grown considerably since).

| Finding | Detail |
|---------|--------|
| Total entries in `data/index/entries.jsonl` | 60 |
| `transcription.status` distribution | `"none"`: 60 / 60 |
| Non-null `alto_path` | 0 / 60 |
| Non-null `hocr_path` | 0 / 60 |
| Non-null `text_path` | 0 / 60 |
| Total entries in `data/index/entries.jsonl` (spike) | 60 |
| Total entries (as of 2026-05) | 373 (111 sources, 48 unique creators) |
| `transcription.status` distribution | `"none"`: all entries at spike time |
| Non-null `alto_path` | 0 at spike time |
| Non-null `hocr_path` | 0 at spike time |
| Non-null `text_path` | 0 at spike time |
| Unique `files[].role` values across all entries | `"original"` only |
| Scan directories inspected (`data/scans/`) | `commons__begani_netatikha` (representative sample) — contains only the JPEG scan, no sidecars |

Expand All @@ -38,9 +39,9 @@ The file-role enum (`original`, `normalized`, `thumbnail`, `transcription`, `met
`transcription` role, but zero entries exercise it.

Source files consulted:
- `HeOCR/public-domain-hand-written-hebrew-scans/schemas/entry.schema.json`
- `HeOCR/public-domain-hand-written-hebrew-scans/data/index/entries.jsonl`
- `HeOCR/public-domain-hand-written-hebrew-scans/data/scans/commons__begani_netatikha/`
- `HeOCR/hash/schemas/entry.schema.json`
- `HeOCR/hash/data/index/entries.jsonl`
- `HeOCR/hash/data/scans/commons__begani_netatikha/`

---

Expand Down
2 changes: 1 addition & 1 deletion docs/design/writer_attribution.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ the local upstream checkout and declares one or more writer blocks.

```json
{
"upstream_path": "../public-domain-hand-written-hebrew-scans",
"upstream_path": "../hash",
"writers": [
{
"writer_id": "writer_bialik",
Expand Down
6 changes: 3 additions & 3 deletions docs/letter_set_v1.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ is exercised by CI and must remain valid.
},
"generated_at": "2026-05-12T00:00:00Z",
"upstream": {
"repo": "HeOCR/public-domain-hand-written-hebrew-scans",
"repo": "HeOCR/hash",
"revision": "<git commit sha>"
},
"letters": {
Expand Down Expand Up @@ -58,7 +58,7 @@ labels only. If in doubt, omit.

**Required.** Records how the writer identity was established and which
upstream scan entries are attributed to them. `source_repo` is normally
`HeOCR/public-domain-hand-written-hebrew-scans`; `source_entry_ids` are
`HeOCR/hash`; `source_entry_ids` are
the upstream `entries.jsonl` ids. `attribution_method` is a short tag
(e.g. `collection_metadata`, `manual_review`, `fixture`).

Expand Down Expand Up @@ -115,7 +115,7 @@ A mapping from a single Hebrew letter character (base or final form,
| `asset_path` | POSIX path relative to the letter-set root. No leading `/` (schema-enforced); no `..` segment (cross-field-enforced). |
| `checksum_sha256` | Lowercase SHA-256 hex digest of the asset bytes. Real letter sets must use real checksums; the example fixture's all-zero/all-one digests are intentional placeholders. |
| `image.{width_px,height_px,format}` | Image metadata. `format` ∈ `png`, `webp`, `tiff`. |
| `source.scan_entry_id` | Upstream entry id (resolves in `public-domain-hand-written-hebrew-scans`). Cross-field validator checks it appears in `writer_provenance.source_entry_ids`. |
| `source.scan_entry_id` | Upstream entry id (resolves in `hash`). Cross-field validator checks it appears in `writer_provenance.source_entry_ids`. |
| `source.scan_url` | Optional URL pointer to the source scan. RFC 3986 URI; checked when format-checking is enabled. |
| `source.license` | One of the accepted SPDX / `LicenseRef-*` identifiers (see `$defs.license_id` in the schema). Extending the allow-list requires a schema change. |
| `source.rights_evidence` | Optional free-form note or URL with rights evidence. |
Expand Down
4 changes: 2 additions & 2 deletions docs/repository_scope.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ intentionally narrow.
## Position in the HeOCR system (canonical)

```
public-domain-hand-written-hebrew-scans (full-page scans, PD / CC / CC-BY)
hash (full-page scans, PD / CC / CC-BY)
hletterscriptgen (code/framework — this repo)
Expand Down Expand Up @@ -39,7 +39,7 @@ copy it — only one diagram should ever rot.

| Concern | Where it lives |
| --- | --- |
| Hosting page scans and rights records | `HeOCR/public-domain-hand-written-hebrew-scans` |
| Hosting page scans and rights records | `HeOCR/hash` |
| Hosting per-writer letter-glyph datasets | `HeOCR/hletterscript` |
| Composing synthetic Hebrew handwritten pages | `HeOCR/hocrsyngen` |
| Dataset orchestration, governance, release assembly, publication | `HeOCR/hocrgen` |
Expand Down
2 changes: 1 addition & 1 deletion docs/upstream_integration.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Upstream integration

`hletterscriptgen` consumes scans from
[`HeOCR/public-domain-hand-written-hebrew-scans`](https://github.com/HeOCR/public-domain-hand-written-hebrew-scans).
[`HeOCR/hash`](https://github.com/HeOCR/hash) — HASH (Hebrew Archive of Scanned Handwriting).
That upstream repo holds the authoritative rights records; this repo
defers to them.

Expand Down
Loading