Skip to content

C05: Medieval manuscript tier — Cairo Geniza, Yemenite, Provençal (14 pages, PDM-1.0) #8

@shaypal5

Description

@shaypal5

Summary

Ingest per-letter crops from the medieval and early-modern manuscript scans already in the upstream repo. These represent a distinct script tier — Geniza-era semi-cursive, medieval Ashkenazi square, Yemenite, Provençal cursive, Rashi script — and should be developed as a self-contained batch after C02–C04 establish the modern-handwriting foundation.

Available upstream sources

entry_id license Script tradition Notes
commons__geniza_education_ts_k5_13__p0001 PDM-1.0 Geniza semi-cursive Alphabet writing exercise — most structured; start here
commons__bodleian_geniza_ms_heb_d_41_4b__p0001 PDM-1.0 Geniza High-res Bodleian scan
commons__bodleian_geniza_ms_heb_e_39_78b__p0001 PDM-1.0 Geniza codex leaf
commons__chushiel_letter_geniza__p0001 PDM-1.0 Geniza R. Hushiel ben Elhanan, c. 1027
commons__halper113_midrash_david_colophon__p0001 PDM-1.0 Geniza Dated colophon, 1299
commons__halper462_exilarch_genealogy__p0001 PDM-1.0 Geniza Exilarch genealogy document
commons__judah_halevi_letter_ts_8j18_5__p0001 PDM-1.0 Geniza Attributed to Judah ha-Levi (d. 1141)
commons__maimonides_responsum_ts_ns_163_57__p0001 PDM-1.0 Geniza Attributed to Maimonides' hand
commons__obadiah_proselyte_kaufmann_ms24__p0001–p0004 PDM-1.0 Geniza 4 pages; 12th-century memoir scroll
commons__sefer_harazim_geniza__p0001–p0006 PDM-1.0 Geniza 6 pages
commons__more_nevuchim_yemenite__p0001 PDM-1.0 Yemenite 13th–14th century
commons__provencal_hebrew_cursive_17c__p0001 PDM-1.0 Provençal cursive 17th century
commons__rashi_pentateuch_paseq_crop__p0001 PDM-1.0 Rashi script 14th century
commons__mahzor_vitry_1204__p0001 PDM-1.0 Ashkenazi square 1204

Pre-ingest decisions required before any data PR

The following policy questions must be resolved (via AGENTS.md update or a dedicated policy PR) before opening the ingest PR:

1. writer_id policy for anonymous Geniza scribes

The current schema requires every entry to reference a writers.jsonl row. Geniza scribes are typically unidentified. Options:

  • A: One writer row per source (geniza_scribe_<source_id_slug>) — maximum granularity, but writer rows convey no real biographical data
  • B: Style-grouped rows (geniza_scribe_11c_sephardic, geniza_scribe_12c_mizrahi, etc.) — requires script-tradition classification
  • C: A single shared geniza_anonymous writer row with status: candidate — simplest, but loses provenance signal

2. Named historical scribes (Maimonides, Judah ha-Levi)

Attribution in Geniza scholarship is often "likely" or "attributed to." Recommended: create individual writer rows with status: candidate and document the scholarly consensus (or lack thereof) in ingest.agent_notes.

3. Quality bar for damaged manuscripts

Many Geniza leaves have ink loss, water stains, and irregular surfaces. Define upfront whether quality.legibility: low crops are ingested at all, and what usable_for_htr / usable_for_syngen thresholds apply.

Suggested starting point

commons__geniza_education_ts_k5_13__p0001 — a Hebrew alphabet writing exercise — is the most structured source in this set and will produce the cleanest crops with the least ambiguity about letter identity. Start there to test the medieval-tier pipeline before moving to literary manuscripts.

Acceptance criteria

Deferred until the three pre-ingest decisions above are resolved. This issue should produce either:

  • An AGENTS.md / docs/dataset_structure.md update PR addressing the anonymous-scribe policy, then an ingest PR, or
  • A decision in comments to close this issue and replace it with per-tradition sub-issues (C05a Geniza, C05b Yemenite, etc.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions