Summary
Ingest per-letter crops from the medieval and early-modern manuscript scans already in the upstream repo. These represent a distinct script tier — Geniza-era semi-cursive, medieval Ashkenazi square, Yemenite, Provençal cursive, Rashi script — and should be developed as a self-contained batch after C02–C04 establish the modern-handwriting foundation.
Available upstream sources
| entry_id |
license |
Script tradition |
Notes |
commons__geniza_education_ts_k5_13__p0001 |
PDM-1.0 |
Geniza semi-cursive |
Alphabet writing exercise — most structured; start here |
commons__bodleian_geniza_ms_heb_d_41_4b__p0001 |
PDM-1.0 |
Geniza |
High-res Bodleian scan |
commons__bodleian_geniza_ms_heb_e_39_78b__p0001 |
PDM-1.0 |
Geniza codex leaf |
|
commons__chushiel_letter_geniza__p0001 |
PDM-1.0 |
Geniza |
R. Hushiel ben Elhanan, c. 1027 |
commons__halper113_midrash_david_colophon__p0001 |
PDM-1.0 |
Geniza |
Dated colophon, 1299 |
commons__halper462_exilarch_genealogy__p0001 |
PDM-1.0 |
Geniza |
Exilarch genealogy document |
commons__judah_halevi_letter_ts_8j18_5__p0001 |
PDM-1.0 |
Geniza |
Attributed to Judah ha-Levi (d. 1141) |
commons__maimonides_responsum_ts_ns_163_57__p0001 |
PDM-1.0 |
Geniza |
Attributed to Maimonides' hand |
commons__obadiah_proselyte_kaufmann_ms24__p0001–p0004 |
PDM-1.0 |
Geniza |
4 pages; 12th-century memoir scroll |
commons__sefer_harazim_geniza__p0001–p0006 |
PDM-1.0 |
Geniza |
6 pages |
commons__more_nevuchim_yemenite__p0001 |
PDM-1.0 |
Yemenite |
13th–14th century |
commons__provencal_hebrew_cursive_17c__p0001 |
PDM-1.0 |
Provençal cursive |
17th century |
commons__rashi_pentateuch_paseq_crop__p0001 |
PDM-1.0 |
Rashi script |
14th century |
commons__mahzor_vitry_1204__p0001 |
PDM-1.0 |
Ashkenazi square |
1204 |
Pre-ingest decisions required before any data PR
The following policy questions must be resolved (via AGENTS.md update or a dedicated policy PR) before opening the ingest PR:
1. writer_id policy for anonymous Geniza scribes
The current schema requires every entry to reference a writers.jsonl row. Geniza scribes are typically unidentified. Options:
- A: One writer row per source (
geniza_scribe_<source_id_slug>) — maximum granularity, but writer rows convey no real biographical data
- B: Style-grouped rows (
geniza_scribe_11c_sephardic, geniza_scribe_12c_mizrahi, etc.) — requires script-tradition classification
- C: A single shared
geniza_anonymous writer row with status: candidate — simplest, but loses provenance signal
2. Named historical scribes (Maimonides, Judah ha-Levi)
Attribution in Geniza scholarship is often "likely" or "attributed to." Recommended: create individual writer rows with status: candidate and document the scholarly consensus (or lack thereof) in ingest.agent_notes.
3. Quality bar for damaged manuscripts
Many Geniza leaves have ink loss, water stains, and irregular surfaces. Define upfront whether quality.legibility: low crops are ingested at all, and what usable_for_htr / usable_for_syngen thresholds apply.
Suggested starting point
commons__geniza_education_ts_k5_13__p0001 — a Hebrew alphabet writing exercise — is the most structured source in this set and will produce the cleanest crops with the least ambiguity about letter identity. Start there to test the medieval-tier pipeline before moving to literary manuscripts.
Acceptance criteria
Deferred until the three pre-ingest decisions above are resolved. This issue should produce either:
- An AGENTS.md /
docs/dataset_structure.md update PR addressing the anonymous-scribe policy, then an ingest PR, or
- A decision in comments to close this issue and replace it with per-tradition sub-issues (C05a Geniza, C05b Yemenite, etc.)
Summary
Ingest per-letter crops from the medieval and early-modern manuscript scans already in the upstream repo. These represent a distinct script tier — Geniza-era semi-cursive, medieval Ashkenazi square, Yemenite, Provençal cursive, Rashi script — and should be developed as a self-contained batch after C02–C04 establish the modern-handwriting foundation.
Available upstream sources
commons__geniza_education_ts_k5_13__p0001commons__bodleian_geniza_ms_heb_d_41_4b__p0001commons__bodleian_geniza_ms_heb_e_39_78b__p0001commons__chushiel_letter_geniza__p0001commons__halper113_midrash_david_colophon__p0001commons__halper462_exilarch_genealogy__p0001commons__judah_halevi_letter_ts_8j18_5__p0001commons__maimonides_responsum_ts_ns_163_57__p0001commons__obadiah_proselyte_kaufmann_ms24__p0001–p0004commons__sefer_harazim_geniza__p0001–p0006commons__more_nevuchim_yemenite__p0001commons__provencal_hebrew_cursive_17c__p0001commons__rashi_pentateuch_paseq_crop__p0001commons__mahzor_vitry_1204__p0001Pre-ingest decisions required before any data PR
The following policy questions must be resolved (via AGENTS.md update or a dedicated policy PR) before opening the ingest PR:
1.
writer_idpolicy for anonymous Geniza scribesThe current schema requires every entry to reference a
writers.jsonlrow. Geniza scribes are typically unidentified. Options:geniza_scribe_<source_id_slug>) — maximum granularity, but writer rows convey no real biographical datageniza_scribe_11c_sephardic,geniza_scribe_12c_mizrahi, etc.) — requires script-tradition classificationgeniza_anonymouswriter row withstatus: candidate— simplest, but loses provenance signal2. Named historical scribes (Maimonides, Judah ha-Levi)
Attribution in Geniza scholarship is often "likely" or "attributed to." Recommended: create individual writer rows with
status: candidateand document the scholarly consensus (or lack thereof) iningest.agent_notes.3. Quality bar for damaged manuscripts
Many Geniza leaves have ink loss, water stains, and irregular surfaces. Define upfront whether
quality.legibility: lowcrops are ingested at all, and whatusable_for_htr/usable_for_syngenthresholds apply.Suggested starting point
commons__geniza_education_ts_k5_13__p0001— a Hebrew alphabet writing exercise — is the most structured source in this set and will produce the cleanest crops with the least ambiguity about letter identity. Start there to test the medieval-tier pipeline before moving to literary manuscripts.Acceptance criteria
Deferred until the three pre-ingest decisions above are resolved. This issue should produce either:
docs/dataset_structure.mdupdate PR addressing the anonymous-scribe policy, then an ingest PR, or