Skip to content

docs(sources): add HTR tooling survey — Kraken/eScriptorium, MiDRASH, HTR4PGP, Eynollah, dhSegment#34

Merged
shaypal5 merged 1 commit into
mainfrom
docs/htr-tooling-survey
May 24, 2026
Merged

docs(sources): add HTR tooling survey — Kraken/eScriptorium, MiDRASH, HTR4PGP, Eynollah, dhSegment#34
shaypal5 merged 1 commit into
mainfrom
docs/htr-tooling-survey

Conversation

@shaypal5

Copy link
Copy Markdown
Contributor

What

Adds docs/sources/chatgpt_summary_htr_tools.md — a planning reference doc
capturing the open-source layout-analysis and HTR tooling landscape for
historical Hebrew manuscripts.

Why

The survey identified that several tools and pre-trained models are completely
absent from the project's planning docs, even though they directly apply to
material already in the corpus:

  • Kraken + eScriptorium — the gold-standard RTL-native HTR toolchain;
    produces ALTO/PAGE-XML that slots into the existing alto_path/hocr_path
    entry-schema fields.
  • MiDRASH Project and Princeton Geniza Project HTR4PGP — pre-trained
    Kraken models for Cairo Genizah fragments, usable today against the T-S /
    Halper items and the openn__cairo_genizah candidate.
  • Eynollah — multi-zone layout detection; best for Talmud + Rashi
    column layouts.
  • dhSegment — pixel-wise segmentation; most robust for degraded parchment.
  • HTR-United catalog — discovery resource for further Hebrew training sets.

Changes

  • docs/sources/chatgpt_summary_htr_tools.md (new file, 117 lines)

No changes to data/index/, schemas, scripts, or tests — doc-only.

🤖 Generated with Claude Code

…, HTR4PGP, Eynollah

Captures the layout-analysis and HTR tool landscape for historical Hebrew
manuscripts. No data sources or source-index changes; doc-only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shaypal5 shaypal5 added documentation Improvements or additions to documentation size:S Small PR (single file or trivial change) labels May 24, 2026
@shaypal5 shaypal5 merged commit 085cc4a into main May 24, 2026
1 check passed
@shaypal5 shaypal5 deleted the docs/htr-tooling-survey branch May 24, 2026 18:54
shaypal5 added a commit that referenced this pull request May 31, 2026
- scripts/vision_transcribe.py: new script — sends all scan images to
  GPT-4o (vision, high-detail) with a Hebrew/Yiddish paleographer prompt;
  saves plain-text transcripts to data/transcripts/ and stamps each entry
  in entries.jsonl with status='raw', created_by='ocr',
  verification_status='unverified'
- data/transcripts/: 186 new AI-draft .txt files covering NLI (Hannah
  Senesh diaries/notebook/pocket diary/speech), Wikimedia Commons letters,
  Library of Congress manuscripts, and the OPenn Zucker ketubah
- data/index/entries.jsonl: transcription field added for all 186 entries;
  validation passes (198 entries)
- scripts/review_app/app.py: four new routes — GET /transcripts (list with
  status filter tabs), GET /transcript/<id> (single-entry editor with
  prev/next nav), GET/POST /api/transcript/<id> (JSON read/write); saving
  promotes verification_status to primary_page_checked
- scripts/review_app/templates/transcript_list.html: card grid with
  filter tabs (all/raw/reviewed/aligned/rejected/none), thumbnail previews,
  RTL snippet, status badges
- scripts/review_app/templates/transcript_review.html: side-by-side scan
  viewer + RTL textarea editor; Ctrl+S saves, Ctrl+Enter approves,
  lightbox zoom, scan size toggle, reset-to-AI-draft; keyboard prev/next
- scripts/review_app/templates/base.html: add Transcripts nav link
- docs/transcripts_status.md: documents blockers and all sources tried

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation size:S Small PR (single file or trivial change)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant