diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 6b5e473..714894d 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -30,6 +30,10 @@ jobs: uses: actions/checkout@v4 with: repository: HeOCR/hash + # Pin to the latest upstream commit referenced by the current + # hletterscript indexes so cross-validation stays reproducible + # even if HASH later removes or renames source entries. + ref: 0e162a78d609a47858c460431be3d689b2e9e31e path: .upstream lfs: false - name: Install dev dependencies diff --git a/README.md b/README.md index 2a4f709..23c8441 100644 --- a/README.md +++ b/README.md @@ -1,23 +1,61 @@ # hletterscript -A dataset of **sets of per-letter images of handwritten Hebrew letters**. -Each set groups crops produced from documents written by the *same -writer*; each set typically contains several variants of the same letter -cut from different scans by that writer. - -This repository is the downstream of: +[![CI](https://github.com/HeOCR/hletterscript/actions/workflows/ci.yml/badge.svg)](https://github.com/HeOCR/hletterscript/actions/workflows/ci.yml) +![Metadata license](https://img.shields.io/badge/metadata-CC0--1.0-brightgreen) +![Letter crops](https://img.shields.io/badge/letter%20crops-48-blue) +![Writer sets](https://img.shields.io/badge/writer%20sets-2-blue) + +Created by [Shay Palachy Affek](http://www.shaypalachy.com/). + +Per-writer sets of cropped handwritten Hebrew letter images, extracted +from rights-clean page scans for synthetic handwriting generation and +OCR/HTR research. + +![Sample grid of handwritten Hebrew letter crops](docs/assets/hletterscript-sample-grid.png) + +## At a Glance + +| Field | Value | +| --- | --- | +| Corpus shape | Writer-grouped Hebrew letter crops | +| Current seed corpus | 48 crops from 2 verified writers | +| Covered letter forms | 25 Hebrew letter-form slugs | +| Current writers | Chaim Nachman Bialik, Rachel Bluwstein | +| Canonical crop index | [`data/index/entries.jsonl`](data/index/entries.jsonl) | +| Canonical writer index | [`data/index/writers.jsonl`](data/index/writers.jsonl) | +| Image storage | [`data/letters///`](data/letters/) via Git LFS | +| Metadata license | CC0 1.0 | +| Per-image rights | Inherited per crop from the upstream scan | + +## What This Repository Contains + +`hletterscript` is a data repository for individual Hebrew letter crops, +grouped by the person who wrote them. Each entry links the crop back to +the source scan, bounding box, extraction method, file checksum, quality +flags, and inherited rights statement. + +The corpus is intentionally small and strict at this stage: it is a +validated seed dataset for the surrounding HeOCR tooling, not yet a +complete Hebrew handwriting corpus. + +## Pipeline Position + +```mermaid +flowchart LR + HASH["HeOCR/hash
rights-clean page scans"] --> GEN["HeOCR/hletterscriptgen
crop extraction"] + GEN --> THIS["HeOCR/hletterscript
per-writer letter sets"] + THIS --> SYN["HeOCR/hocrsyngen
synthetic documents"] + SYN --> OCR["HeOCR / HeOCRsynth
OCR and HTR corpora"] +``` -- [HeOCR/hash][upstream] (HASH — Hebrew Archive of Scanned Handwriting) — the - canonical, permissively-licensed source of page-level scans. Every - entry here cites its upstream scan. -- [HeOCR/hletterscriptgen][gen] — the framework that turns page scans - into per-letter crops. Each entry records which version of that - framework produced it. +Related projects: -The intended downstream consumers are synthetic-document generators -([HeOCR/hocrsyngen][syngen]) and the synthetic / real Hebrew handwriting -corpora they feed into ([HeOCR/HeOCRsynth][heocrsynth], -[HeOCR/HeOCR][heocr]). +- [HeOCR/hash][upstream] is the page-scan source of truth. +- [HeOCR/hletterscriptgen][gen] produces the letter crops. +- [HeOCR/hocrsyngen][syngen] consumes this repository for synthetic + document generation. +- [HeOCR/HeOCRsynth][heocrsynth] and [HeOCR/HeOCR][heocr] are intended + downstream OCR/HTR targets. [upstream]: https://github.com/HeOCR/hash [gen]: https://github.com/HeOCR/hletterscriptgen @@ -27,84 +65,86 @@ corpora they feed into ([HeOCR/HeOCRsynth][heocrsynth], ## Dataset Layout -- `docs/dataset_structure.md` defines the repository layout and - ingestion model. -- `docs/letters.md` is the canonical Hebrew-letter enumeration - (27 forms — 22 base letters plus the 5 finals). -- `data/index/writers.jsonl` is the set-level catalog: one JSON object - per writer/scribe. -- `data/index/entries.jsonl` is the image-level catalog: one JSON - object per cropped letter image, with upstream provenance, - extraction provenance, file checksums, and inherited rights. -- `data/letters///` stores the image bytes. -- `schemas/writer.schema.json` and `schemas/entry.schema.json` define - the record contracts. -- `scripts/validate_indexes.py` validates JSONL records against the - schemas, enforces referential integrity, checks Hebrew-letter - codepoint/name/form consistency, pins the upstream repo URL, and - re-verifies image file checksums and sizes on disk. -- `scripts/generate_release_artifacts.py` regenerates `NOTICE.md`, - `CITATION.cff`, and `datapackage.json` deterministically from the +- [`docs/dataset_structure.md`](docs/dataset_structure.md) defines the + repository layout, index model, rights inheritance, and ingestion flow. +- [`docs/letters.md`](docs/letters.md) is the canonical Hebrew-letter + enumeration: 27 forms, covering the 22 base letters plus 5 finals. +- [`data/index/writers.jsonl`](data/index/writers.jsonl) is the set-level + catalog: one JSON object per writer or scribe. +- [`data/index/entries.jsonl`](data/index/entries.jsonl) is the crop-level + catalog: one JSON object per image, with upstream provenance, + extraction provenance, file checksums, quality flags, and inherited + rights. +- [`data/letters/`](data/letters/) stores the crop image bytes. +- [`schemas/writer.schema.json`](schemas/writer.schema.json) and + [`schemas/entry.schema.json`](schemas/entry.schema.json) define the + record contracts. +- [`scripts/validate_indexes.py`](scripts/validate_indexes.py) validates + JSONL records, referential integrity, Hebrew-letter consistency, + upstream URLs, image checksums, and file sizes. +- [`scripts/generate_release_artifacts.py`](scripts/generate_release_artifacts.py) + regenerates [`NOTICE.md`](NOTICE.md), [`CITATION.cff`](CITATION.cff), + and [`datapackage.json`](datapackage.json) deterministically from the indexes. -- `LICENSE.md` documents the compound licensing policy for - metadata and per-image inherited rights. -## Serialization Decision +## Licensing Model -The canonical editable indexes are newline-delimited JSON (`.jsonl`), -matching the upstream scans repo's convention. +This repository uses a compound licensing model: -JSONL is deliberately used instead of CSV because these records need -nested upstream references, bounding boxes, rights inheritance, -extraction provenance, and quality measurements. CSV/Parquet/SQLite -exports can be generated later as derived artefacts; the source of -truth stays line-oriented, diffable, streamable JSON. +- Repository-authored metadata, schemas, scripts, docs, and generated + metadata exports are dedicated to the public domain under CC0 1.0. +- Each crop carries its own inherited rights block from the upstream scan. + Current seed crops are public-domain compatible, but consumers should + read the per-entry rights metadata rather than assume a uniform image + license. + +See [`LICENSE`](LICENSE) for the repository metadata license and +[`LICENSE.md`](LICENSE.md) for the full per-image rights policy. ## Requirements -- **Python ≥ 3.11** (the validator uses `hashlib.file_digest`). - CI pins 3.12. -- **Git LFS** — image bytes under `data/letters/**` are tracked via - LFS (see `.gitattributes`). After cloning, run `git lfs install` - once, then `git lfs pull` to fetch the actual image bytes. +- Python 3.11 or newer. CI currently validates 3.11, 3.12, and 3.13. +- Git LFS for the image bytes under `data/letters/**`. -Run the current validation check with: +After cloning: ```bash -git lfs install && git lfs pull +git lfs install +git lfs pull python3 -m pip install -r requirements-dev.txt +``` + +## Validate Locally + +```bash python3 scripts/validate_indexes.py python3 scripts/generate_release_artifacts.py --check python3 -m pytest ``` +For the full CI-style upstream cross-check, place a checkout of +[`HeOCR/hash`](https://github.com/HeOCR/hash) at `.upstream` and run: + +```bash +python3 scripts/validate_indexes.py --upstream-path .upstream +``` + ## Current Status -`v0.0.0-rc` — **initial setup**. The repository ships with the -schemas, validation tooling, release-artifact generator, CI workflow, -and licensing policy in place. The per-letter image indexes -(`writers.jsonl`, `entries.jsonl`) are empty: actual letter-image -ingestion happens in subsequent PRs, produced by -[HeOCR/hletterscriptgen][gen] from scans in the upstream repo. - -The repository uses a compound licensing model: repository-authored -metadata is dedicated to the public domain under CC0 1.0 (see -[`LICENSE`](LICENSE)), while per-image rights are recorded individually -and inherited from each crop's upstream scan. See [`LICENSE.md`]\ -(LICENSE.md) for the full policy, including the CC BY-SA ShareAlike -caveat and the rules for remix-friendly release bundles. - -## How to use this repo - -- [`data/index/entries.jsonl`](data/index/entries.jsonl) is the source - of truth for the per-letter image corpus — one JSON object per crop, - with upstream citation, file checksums, and inherited rights. -- [`data/index/writers.jsonl`](data/index/writers.jsonl) catalogs the - writers, including candidate leads and rejected records. -- [`schemas/entry.schema.json`](schemas/entry.schema.json) and - [`schemas/writer.schema.json`](schemas/writer.schema.json) define the - record contracts; [`scripts/validate_indexes.py`]\ - (scripts/validate_indexes.py) enforces them in CI. -- Contributors adding new entries should start with - [`AGENTS.md`](AGENTS.md) for ingest rules, naming, and the pre-PR - checklist. +`v0.0.0-rc` is a validated seed corpus with 48 indexed letter crops from +2 verified writers. The repository has schema validation, deterministic +release-artifact generation, CI, and licensing policy in place. + +Future ingestion work expands writer coverage and fills missing Hebrew +letter forms through crops produced by [HeOCR/hletterscriptgen][gen] from +the upstream HASH scans. + +## Contributing + +Contributors adding or reviewing crop entries should start with +[`AGENTS.md`](AGENTS.md). It captures ingest rules, naming conventions, +rights constraints, and the pre-PR checklist for this data repository. + +## Credits + +Created by [Shay Palachy Affek ](http://www.shaypalachy.com/) [[GitHub](https://github.com/shaypal5)] diff --git a/docs/assets/hletterscript-sample-grid.png b/docs/assets/hletterscript-sample-grid.png new file mode 100644 index 0000000..510d53d Binary files /dev/null and b/docs/assets/hletterscript-sample-grid.png differ