From c282ddad63e220ae3f8f397e37644f93a709be22 Mon Sep 17 00:00:00 2001 From: Shay Palachy Date: Sun, 24 May 2026 22:15:52 +0300 Subject: [PATCH] data(sources): enrich candidate source list from ChatGPT survey 3 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds 3 new candidate sources and enriches 10 existing candidates with specific URLs, scale data, and access notes surfaced by a prioritised commercial-use source survey (chatgpt_summary_3.md). New sources ─────────── - commons__hebrew_language_manuscripts — Wikimedia Commons parent category (17 subcats + ~105 files: Cairo Geniza, Bible MSS, illuminated MSS, Wellcome, Damascus Pentateuch) - commons__hebrew_calligraphy — Wikimedia Commons (~74 files + subcats; illuminated MSS and ketubot) - openn__judaica_collection_index — OPenn Judaica umbrella index (openn.library.upenn.edu/html/judaica_contents.html); covers Gaster Hebrew MSS and other sub-collections not yet individually tracked Enriched sources ──────────────── - openn__bl_hebrew_manuscripts: landing URL (collection 0032), scale confirmed ~435,000 images, Polonsky Foundation ref added - openn__cairo_genizah_fragments: landing URL (genizah_contents.html) - openn__manchester_hebrew_manuscripts: landing URL (0021.html) + critical caveat — Manchester own viewer is CC BY-NC; use OPenn copy (CC BY 4.0) - openn__katz_center_judaica: landing URL (0002.html) - openn__zucker_ketubah_collection: landing URL (0051.html) - leipzig__hebrew_manuscripts: corrected URL to Leipzig direct page - nypl__hebrew_manuscripts_digital_collections: 1,174 results count added - mdz__hebrew_manuscripts: landing URL + scale (~700 pcs incl. 183 fragments) - archive__hebrew_manuscripts: named high-value items (Leningrad Codex, Aleppo Codex, Cervera Bible, Lailashi Codex, Haverford Masoretic Bible) - huggingface__sivan22_hebrew_handwritten: CC BY 3.0 licence detail, 5,093 rows / 28 classes, added policy-review note (CC-BY-3.0 not explicitly in AGENTS.md accepted list) Validation: ok: 93 sources, 345 entries, 345 files verified, recipe ok Tests: 80 passed Co-Authored-By: Claude Sonnet 4.6 --- README.md | 2 +- data/index/sources.jsonl | 23 ++++--- datapackage.json | 8 +-- docs/sources/chatgpt_summary_3.md | 101 ++++++++++++++++++++++++++++++ exports/sources.csv | 23 ++++--- 5 files changed, 132 insertions(+), 25 deletions(-) create mode 100644 docs/sources/chatgpt_summary_3.md diff --git a/README.md b/README.md index da3f1fa..bfcd52d 100644 --- a/README.md +++ b/README.md @@ -61,7 +61,7 @@ make release ## Current Status -The corpus currently contains 345 ingested scans drawn from 59 verified sources, totalling ~371.16 MiB on disk. The source-level index also tracks 13 candidate leads still being researched and 16 source records kept for provenance after being rejected as out of scope. +The corpus currently contains 345 ingested scans drawn from 59 verified sources, totalling ~371.16 MiB on disk. The source-level index also tracks 16 candidate leads still being researched and 16 source records kept for provenance after being rejected as out of scope. License breakdown across the 345 entries: diff --git a/data/index/sources.jsonl b/data/index/sources.jsonl index 65212c2..a83c5ca 100644 --- a/data/index/sources.jsonl +++ b/data/index/sources.jsonl @@ -72,19 +72,22 @@ {"source_id": "nli__nnl_archive_al997009912248505171", "record_type": "item", "status": "verified", "priority": "seed", "provider": "National Library of Israel", "title": "Handwritten Diary of Hannah Szenes in Hebrew and Draft of The Violin", "description": "Strong item-level seed: handwritten Hebrew diary plus play draft, dated 1941-1944.", "urls": {"canonical": "https://www.nli.org.il/en/archives/NNL_ARCHIVE_AL997009912248505171/NLI", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "public_domain", "license_expression": "LicenseRef-Public-Domain-Israel", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": false, "evidence_text": "Seed note says the item page says Any Use Permitted and Public Domain in Israel.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "1941-1944", "languages": ["he"], "document_types": ["diary", "draft"], "creator_names": ["Hannah Senesh"], "expected_handwriting": "yes", "estimated_scan_count": null}, "ingest": {"method": "manual_download", "access_notes": "Verify direct image/PDF access and split multi-page item into entries.", "agent_notes": "Good first page-level ingestion candidate.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_1.md:16-31", "quote": "explicitly handwritten, Hebrew, post-1929; Any Use Permitted; Public Domain in Israel"}]} {"source_id": "nli__nnl_archive_al997009912248705171", "record_type": "item", "status": "verified", "priority": "high", "provider": "National Library of Israel", "title": "Handwritten Diary of Hannah Szenes in Hebrew and Hungarian", "description": "Mixed Hebrew/Hungarian handwritten diary dated 1938-1941.", "urls": {"canonical": "https://www.nli.org.il/en/archives/NNL_ARCHIVE_AL997009912248705171/NLI", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "public_domain", "license_expression": "LicenseRef-Public-Domain-Israel", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": false, "evidence_text": "Seed note says Any Use Permitted and Public Domain in Israel, but remote access may need confirmation.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "1938-1941", "languages": ["he", "hu"], "document_types": ["diary"], "creator_names": ["Hannah Senesh"], "expected_handwriting": "yes", "estimated_scan_count": null}, "ingest": {"method": "manual_download", "access_notes": "Confirm whether online access is available outside the NLI building.", "agent_notes": "Tag Hebrew pages separately from Hungarian-heavy pages.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_1.md:62-78", "quote": "Any Use Permitted; Public Domain in Israel; Online access from NLI building caveat"}]} {"source_id": "nli__shaul_tchernichovsky_archive_items", "record_type": "collection", "status": "rejected", "priority": "exclude", "provider": "National Library of Israel", "title": "Shaul Tchernichovsky handwritten archive items", "description": "NLI leads for receipts, literary drafts, notebooks, and memorandum drafts by Shaul Tchernichovsky.", "urls": {"canonical": "https://www.nli.org.il/he/archives/NNL_ARCHIVE_AL990035912210205171/NLI", "landing": null, "api": null, "download": null, "related": ["https://www.nli.org.il/he/archives/NNL_ARCHIVE_AL990035912380205171/NLI", "https://www.nli.org.il/he/archives/NNL_ARCHIVE_AL990035403420205171/NLI", "https://www.nli.org.il/he/archives/NNL_ARCHIVE_AL990035912230205171/NLI"]}, "rights": {"rights_basis": "unknown", "license_expression": null, "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "All 4 known NLI items checked 2026-05-23: rights are either \"research and study only\" or no redistribution allowed. Entire cluster out of scope for this dataset.", "terms_url": null, "verification_status": "primary_page_checked", "verified_at": "2026-05-23"}, "scope": {"date_range": "1930s-1943", "languages": ["he"], "document_types": ["receipt", "draft", "notebook", "other"], "creator_names": ["Shaul Tchernichovsky"], "expected_handwriting": "yes", "estimated_scan_count": null}, "ingest": {"method": "manual_download", "access_notes": "Expand each NLI item into its own source row before harvesting scans.", "agent_notes": "Useful for handwriting diversity beyond Senesh.", "blocked_reason": "rights_restriction"}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/gemini_summary_1.md:5-34", "quote": "Seven handwritten receipts; The Bandage; Hybrid Notebook; Draft of a Memorandum"}]} -{"source_id": "nypl__hebrew_manuscripts_digital_collections", "record_type": "collection", "status": "candidate", "priority": "high", "provider": "New York Public Library", "title": "NYPL Digital Collections — Hebrew Manuscripts, Ketubbot, and Letters", "description": "Hebrew Illuminated Manuscripts, historical Ketubbot (handwritten marriage contracts), and early modern letters. ~1,174 items in the Hebrew Illuminated Manuscripts sub-collection. Out-of-copyright materials are completely free for any use including commercial.", "urls": {"canonical": "https://digitalcollections.nypl.org/", "landing": "https://digitalcollections.nypl.org/collections/hebrew-illuminated-manuscripts", "api": null, "download": null, "related": []}, "rights": {"rights_basis": "public_domain", "license_expression": "PDM-1.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "NYPL policy: out-of-copyright digital materials are completely free for any use including commercial, no permission required. Source: docs/sources/gemini_summary_2.md.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "yi", "other"], "document_types": ["manuscript", "ketubbah", "letter", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": 1174}, "ingest": {"method": "api", "access_notes": "Filter by public domain on portal; public API available for programmatic download. Need to identify specific in-scope Hebrew handwriting items.", "agent_notes": "Ketubbot (marriage contracts) are particularly promising — high volume, post-1929 dates possible, standardized form but varied handwriting.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/gemini_summary_2.md", "quote": "NYPL has a policy of making all of its out-of-copyright digital materials completely free for any use, including commercial, with no permission required."}]} -{"source_id": "openn__katz_center_judaica", "record_type": "collection", "status": "candidate", "priority": "high", "provider": "OPenn, University of Pennsylvania Libraries", "title": "Katz Center for Advanced Judaic Studies — Hebrew Manuscripts (OPenn)", "description": "Hundreds of digitized handwritten Hebrew manuscripts, codices, and historical documents. All OPenn content is CC0 1.0 Universal.", "urls": {"canonical": "https://openn.library.upenn.edu/", "landing": "https://openn.library.upenn.edu/ReadMe.html", "api": null, "download": null, "related": []}, "rights": {"rights_basis": "public_domain", "license_expression": "CC0-1.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "OPenn platform policy: all content released under CC0 1.0 Universal (Public Domain Dedication). Source: docs/sources/gemini_summary_2.md.", "terms_url": "https://openn.library.upenn.edu/ReadMe.html", "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "yi", "ar", "other"], "document_types": ["manuscript", "codex", "letter", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": null}, "ingest": {"method": "api", "access_notes": "Bulk rsync or FTP access (openn.library.upenn.edu). Need to identify specific Judaica sub-collections with Hebrew handwriting in scope, then filter by date range.", "agent_notes": "High-value lead: CC0 license, bulk access, and OPenn is explicitly designed for computational use.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/gemini_summary_2.md", "quote": "All of it is released under a CC0 1.0 Universal (Public Domain Dedication). You can pull entire directories of high-res TIFFs/JPEGs and XML metadata via Rsync or direct FTP."}]} +{"source_id": "nypl__hebrew_manuscripts_digital_collections", "record_type": "collection", "status": "candidate", "priority": "high", "provider": "New York Public Library", "title": "NYPL Digital Collections — Hebrew Manuscripts, Ketubbot, and Letters", "description": "Hebrew Illuminated Manuscripts, historical Ketubbot (handwritten marriage contracts), and early modern letters. ~1,174 items in the Hebrew Illuminated Manuscripts sub-collection. Out-of-copyright materials are completely free for any use including commercial.", "urls": {"canonical": "https://digitalcollections.nypl.org/", "landing": "https://digitalcollections.nypl.org/collections/hebrew-illuminated-manuscripts", "api": null, "download": null, "related": []}, "rights": {"rights_basis": "public_domain", "license_expression": "PDM-1.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "NYPL policy: out-of-copyright digital materials are completely free for any use including commercial, no permission required. Source: docs/sources/gemini_summary_2.md.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "yi", "other"], "document_types": ["manuscript", "ketubbah", "letter", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": 1174}, "ingest": {"method": "api", "access_notes": "Filter by public domain on portal; public API available for programmatic download. Need to identify specific in-scope Hebrew handwriting items.", "agent_notes": "Hebrew Illuminated Manuscripts sub-collection has 1,174 results. Filter by public domain on portal; public API available for programmatic download. Ketubbot are particularly promising — high volume, post-1929 dates possible, standardised form but varied handwriting. Must filter to 'public domain' items only before downloading.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/gemini_summary_2.md", "quote": "NYPL has a policy of making all of its out-of-copyright digital materials completely free for any use, including commercial, with no permission required."}, {"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_3.md", "quote": "NYPL search result shows 1,174 collection results; use only items marked public domain"}]} +{"source_id": "openn__katz_center_judaica", "record_type": "collection", "status": "candidate", "priority": "high", "provider": "OPenn, University of Pennsylvania Libraries", "title": "Katz Center for Advanced Judaic Studies — Hebrew Manuscripts (OPenn)", "description": "Hundreds of digitized handwritten Hebrew manuscripts, codices, and historical documents. All OPenn content is CC0 1.0 Universal.", "urls": {"canonical": "https://openn.library.upenn.edu/", "landing": "https://openn.library.upenn.edu/html/0002.html", "api": null, "download": null, "related": []}, "rights": {"rights_basis": "public_domain", "license_expression": "CC0-1.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "OPenn platform policy: all content released under CC0 1.0 Universal (Public Domain Dedication). Source: docs/sources/gemini_summary_2.md.", "terms_url": "https://openn.library.upenn.edu/ReadMe.html", "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "yi", "ar", "other"], "document_types": ["manuscript", "codex", "letter", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": null}, "ingest": {"method": "api", "access_notes": "Bulk rsync or FTP access (openn.library.upenn.edu). Need to identify specific Judaica sub-collections with Hebrew handwriting in scope, then filter by date range.", "agent_notes": "High-value lead: CC0 license, bulk access, and OPenn is explicitly designed for computational use.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/gemini_summary_2.md", "quote": "All of it is released under a CC0 1.0 Universal (Public Domain Dedication). You can pull entire directories of high-res TIFFs/JPEGs and XML metadata via Rsync or direct FTP."}, {"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_3.md", "quote": "Penn manuscript holdings, including Judaica/Judeo-Arabic/Hebrew items; images are public domain / free of known copyright restrictions"}]} {"source_id": "wikimedia__handwritten_hebrew_letters", "record_type": "category", "status": "verified", "priority": "medium", "provider": "Wikimedia Commons", "title": "Category: Handwritten Hebrew letters", "description": "Commons category lead for freely licensed or public-domain handwritten Hebrew letter images and related media.", "urls": {"canonical": "https://commons.wikimedia.org/wiki/Category:Handwritten_Hebrew_letters", "landing": null, "api": null, "download": null, "related": ["https://commons.wikimedia.org/wiki/Category:Hebrew_handwriting_scripts"]}, "rights": {"rights_basis": "mixed", "license_expression": "PDM-1.0", "commercial_use_allowed": true, "derivatives_allowed": true, "scan_redistribution_allowed": true, "attribution_required": false, "evidence_text": "Per-file verification: 2 qualifying files ingested. File:Delacroix letter.png uses PD-old-100; File:Solitreo contract.jpg uses PD-Art|PD-old-70. Most other category files are SVG teaching samples, character-level crops, or CC-BY-SA 3.0 (excluded). Mixed rights overall; ingested items are all PDM-1.0.", "terms_url": null, "verification_status": "primary_page_checked", "verified_at": "2026-05-15"}, "scope": {"date_range": "mixed", "languages": ["he", "lad"], "document_types": ["letter", "other"], "creator_names": [], "expected_handwriting": "mixed", "estimated_scan_count": 2}, "ingest": {"method": "api", "access_notes": "Use MediaWiki API and file pages; exclude SVG teaching samples unless the dataset explicitly wants vector handwriting examples.", "agent_notes": "Ingested 2 qualifying handwritten scans from Category:Handwritten_Hebrew_letters and subcategory Category:Solitreo_script (under Category:Hebrew_handwriting_scripts). Most files in the main category were excluded: SVG teaching samples, tiny character crops (<55px), group photographs, and CC-BY-SA 3.0 licensed files. Files with {{Wrong license}} template were excluded. Two public-domain Solitreo script documents qualified.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/gemini_summary_1.md:127-141", "quote": "community-moderated repository containing original photographic scans and SVG reproductions"}, {"kind": "repo_note", "citation": "docs/sources/gemini_report_1.md:204-210", "quote": "files generally mandated to be freely usable media"}]} -{"source_id": "openn__bl_hebrew_manuscripts", "record_type": "collection", "status": "candidate", "priority": "high", "provider": "OPenn / British Library", "title": "British Library — Hebrew Manuscripts (OPenn)", "description": "~1,300 Hebrew manuscripts from the British Library digitized and hosted on OPenn under CC0 1.0 Universal. Separate from bl.uk which restricts commercial use.", "urls": {"canonical": "https://openn.library.upenn.edu/", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "cc0", "license_expression": "CC0-1.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "OPenn platform policy: all content CC0 1.0 Universal. BL Hebrew MSS are hosted under this umbrella separately from bl.uk licensing. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": "https://openn.library.upenn.edu/ReadMe.html", "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "ar", "other"], "document_types": ["manuscript", "codex", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": 1300}, "ingest": {"method": "api", "access_notes": "Bulk rsync or FTP from openn.library.upenn.edu. Identify BL sub-collection directory path and filter by Hebrew handwriting.", "agent_notes": "High-value lead: CC0, ~1,300 MSS, separate from restrictive bl.uk terms. Confirm exact OPenn collection code for BL.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "British Library Hebrew MSS: ~1,300 MSS; BL digitized, hosted OPenn, separate from bl.uk (which restricts commercial)"}]} -{"source_id": "openn__cairo_genizah_fragments", "record_type": "collection", "status": "candidate", "priority": "high", "provider": "OPenn (multiple holding institutions)", "title": "Cairo Genizah Fragments — OPenn hosted collections", "description": "Genizah fragments contributed by multiple institutions and hosted on OPenn under CC0. Distinct from Cambridge Digital Library Genizah (which restricts commercial use).", "urls": {"canonical": "https://openn.library.upenn.edu/", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "cc0", "license_expression": "CC0-1.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "OPenn platform-wide CC0 policy applies. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": "https://openn.library.upenn.edu/ReadMe.html", "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval", "languages": ["he", "ar", "jrb", "other"], "document_types": ["manuscript", "fragment", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": null}, "ingest": {"method": "api", "access_notes": "Identify Genizah sub-collection directories on OPenn; use rsync for bulk access.", "agent_notes": "Confirm which OPenn institution codes hold Genizah material; cross-check with cambridge__digital_library_hebrew_genizah (excluded — commercial restriction).", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "Cairo Genizah fragments: several Genizah-holding institutions contribute to OPenn; verify by collection"}]} -{"source_id": "openn__manchester_hebrew_manuscripts", "record_type": "collection", "status": "candidate", "priority": "medium", "provider": "OPenn / University of Manchester (John Rylands Library)", "title": "John Rylands Library — Hebrew Manuscripts (OPenn)", "description": "Hebrew manuscripts from the University of Manchester John Rylands Library hosted on OPenn. JRL policy is CC BY 4.0 (not CC0 like most OPenn content) — verify per item.", "urls": {"canonical": "https://openn.library.upenn.edu/", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "cc_by", "license_expression": "CC-BY-4.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "JRL policy is CC BY 4.0 per the ChatGPT survey; not CC0 like most OPenn content. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": "https://openn.library.upenn.edu/ReadMe.html", "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "ar", "other"], "document_types": ["manuscript", "codex", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": null}, "ingest": {"method": "api", "access_notes": "Bulk rsync/FTP on OPenn. Identify Manchester/JRL collection code. Attribution required for CC BY 4.0.", "agent_notes": "Lower priority than CC0 OPenn collections due to attribution requirement, but still commercially usable.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "Manchester / John Rylands Library Hebrew: CC BY 4.0 per JRL policy (not CC0 like most OPenn) — verify per item"}]} -{"source_id": "openn__zucker_ketubah_collection", "record_type": "collection", "status": "candidate", "priority": "high", "provider": "OPenn / Zucker Manuscript Library", "title": "Zucker Ketubah Collection (OPenn)", "description": "249 handwritten marriage contracts (ketubot) hosted on OPenn under CC0. High-value: varied hands, standardized form with free-form decoration, spanning centuries.", "urls": {"canonical": "https://openn.library.upenn.edu/", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "public_domain", "license_expression": "CC0-1.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "OPenn platform CC0 policy. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": "https://openn.library.upenn.edu/ReadMe.html", "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "ar"], "document_types": ["ketubbah"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": 249}, "ingest": {"method": "api", "access_notes": "Identify Zucker collection directory on OPenn; rsync bulk download.", "agent_notes": "249 ketubot is a well-scoped, manageable batch. Ketubot provide high handwriting diversity in a uniform document type — good for HTR training.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "Zucker Ketubah Collection: 249 marriage contracts, all PD"}]} -{"source_id": "leipzig__hebrew_manuscripts", "record_type": "collection", "status": "candidate", "priority": "medium", "provider": "Leipzig University Library", "title": "Leipzig University Library — Hebrew Manuscripts", "description": "~68 Hebrew manuscripts released under Public Domain Mark. Accessible via Manuscripta Mediaevalia portal.", "urls": {"canonical": "https://www.manuscripta-mediaevalia.de/", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "public_domain", "license_expression": "PDM-1.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "Digital images released under PDM per ChatGPT survey. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "ar", "other"], "document_types": ["manuscript", "codex", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": 68}, "ingest": {"method": "manual_download", "access_notes": "Search Manuscripta Mediaevalia for Leipzig Hebrew items; download image series per manuscript.", "agent_notes": "Modest count (~68) makes this feasible as a manual batch. Verify per-item PDM status before ingesting.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "~68 Hebrew MSS; digital images released under PDM/PD."}]} -{"source_id": "mdz__hebrew_manuscripts", "record_type": "collection", "status": "candidate", "priority": "medium", "provider": "Bayerische Staatsbibliothek / MDZ", "title": "Bayerische Staatsbibliothek — Hebrew Manuscripts (MDZ)", "description": "Large digitized Hebrew collection on the Münchner Digitalisierungszentrum portal. PDM-marked items are safe for unrestricted use; some items carry additional licensing notes.", "urls": {"canonical": "https://www.digitale-sammlungen.de/", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "mixed", "license_expression": null, "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "PDM-marked items are safe; other items may have scan-level licensing notes. Must filter to PDM-only before ingesting. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "ar", "other"], "document_types": ["manuscript", "codex", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": null}, "ingest": {"method": "api", "access_notes": "MDZ has a public API (OAI-PMH and direct download). Filter to items with PDM tag; verify scan-page license before bulk download.", "agent_notes": "Mixed rights — need per-item check. Only ingest PDM-flagged items. Consider as a lower-priority batch after OPenn and Leipzig are exhausted.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "PDM-marked items are safe. Restrict to PDM-flagged items; some items have licensing notes on the scan page."}]} -{"source_id": "archive__hebrew_manuscripts", "record_type": "collection", "status": "candidate", "priority": "low", "provider": "Internet Archive", "title": "Internet Archive — Hebrew Manuscripts (PDM uploads)", "description": "PDM-tagged Hebrew manuscript uploads from various libraries mirrored on the Internet Archive. Quality and format vary; useful as overflow/discovery source.", "urls": {"canonical": "https://archive.org/", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "public_domain", "license_expression": "PDM-1.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "PDM-tagged items only. Internet Archive does not guarantee rights; verify each item's stated license before ingesting. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "yi", "other"], "document_types": ["manuscript", "codex", "other"], "creator_names": [], "expected_handwriting": "mixed", "estimated_scan_count": null}, "ingest": {"method": "api", "access_notes": "Internet Archive S3-like API for bulk download; filter by subject:hebrew AND mediatype:texts AND licenseurl:PDM.", "agent_notes": "Low priority — many IA items duplicate OPenn/LoC sources with lower image quality. Use as discovery/gap-fill after primary sources are exhausted.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "PDM-tagged uploads from various libraries. Quality and format vary. Useful as a discovery/overflow source."}]} -{"source_id": "huggingface__sivan22_hebrew_handwritten", "record_type": "dataset", "status": "candidate", "priority": "low", "provider": "HuggingFace / sivan22", "title": "sivan22/hebrew-handwritten-dataset (HuggingFace)", "description": "5,093 rows of isolated Hebrew character crops under CC BY 3.0. Not page-level scans; useful as external HTR reference but not corpus content.", "urls": {"canonical": "https://huggingface.co/datasets/sivan22/hebrew-handwritten-dataset", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "cc_by", "license_expression": "CC-BY-3.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "Dataset card on HuggingFace states CC BY 3.0. Note: tc11__hhd_v0 has conflicting CC BY-ND 3.0 label — these may share origin. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "modern", "languages": ["he"], "document_types": ["form", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": 5093}, "ingest": {"method": "dataset_download", "access_notes": "HuggingFace datasets API or direct download.", "agent_notes": "Character-level crops only, not page scans. Cross-check license against tc11__hhd_v0 — may share origin with conflicting mirrors.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "5,093 rows of isolated Hebrew character crops; CC BY 3.0."}]} +{"source_id": "openn__bl_hebrew_manuscripts", "record_type": "collection", "status": "candidate", "priority": "high", "provider": "OPenn / British Library", "title": "British Library — Hebrew Manuscripts (OPenn)", "description": "~1,300 Hebrew manuscripts from the British Library digitized and hosted on OPenn under CC0 1.0 Universal. Separate from bl.uk which restricts commercial use.", "urls": {"canonical": "https://openn.library.upenn.edu/", "landing": "https://openn.library.upenn.edu/html/0032_contents.html", "api": null, "download": null, "related": ["https://polonskyfoundation.org/cultural-heritage-and-digitisation/polonsky-foundation-catalogue-of-digitised-hebrew-manuscripts/"]}, "rights": {"rights_basis": "cc0", "license_expression": "CC0-1.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "OPenn platform policy: all content CC0 1.0 Universal. BL Hebrew MSS are hosted under this umbrella separately from bl.uk licensing. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": "https://openn.library.upenn.edu/ReadMe.html", "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "ar", "other"], "document_types": ["manuscript", "codex", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": 435000}, "ingest": {"method": "api", "access_notes": "Bulk rsync or FTP from openn.library.upenn.edu. Identify BL sub-collection directory path and filter by Hebrew handwriting.", "agent_notes": "High-value lead: CC0, ~1,300 MSS / ~435,000 images (Polonsky Foundation digitisation project), separate from restrictive bl.uk terms. OPenn collection code: 0032 (Polonsky BL Hebrew). Survey (chatgpt_summary_3.md) recommends using OPenn rather than the BL viewer.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "British Library Hebrew MSS: ~1,300 MSS; BL digitized, hosted OPenn, separate from bl.uk (which restricts commercial)"}, {"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_3.md", "quote": "Very large: BL project digitized ~1,300 manuscripts / ~435,000 images; OPenn states BL-hosted images are free of known copyright restrictions / public domain"}]} +{"source_id": "openn__cairo_genizah_fragments", "record_type": "collection", "status": "candidate", "priority": "high", "provider": "OPenn (multiple holding institutions)", "title": "Cairo Genizah Fragments — OPenn hosted collections", "description": "Genizah fragments contributed by multiple institutions and hosted on OPenn under CC0. Distinct from Cambridge Digital Library Genizah (which restricts commercial use).", "urls": {"canonical": "https://openn.library.upenn.edu/", "landing": "https://openn.library.upenn.edu/html/genizah_contents.html", "api": null, "download": null, "related": []}, "rights": {"rights_basis": "cc0", "license_expression": "CC0-1.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "OPenn platform-wide CC0 policy applies. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": "https://openn.library.upenn.edu/ReadMe.html", "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval", "languages": ["he", "ar", "jrb", "other"], "document_types": ["manuscript", "fragment", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": null}, "ingest": {"method": "api", "access_notes": "OPenn Genizah contents page: https://openn.library.upenn.edu/html/genizah_contents.html. Identify Genizah sub-collection directories on OPenn; use rsync for bulk access.", "agent_notes": "Confirm which OPenn institution codes hold Genizah material; cross-check with cambridge__digital_library_hebrew_genizah (excluded — commercial restriction).", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "Cairo Genizah fragments: several Genizah-holding institutions contribute to OPenn; verify by collection"}, {"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_3.md", "quote": "OPenn states all images from the Cairo Genizah project are public domain / free of known copyright restrictions"}]} +{"source_id": "openn__manchester_hebrew_manuscripts", "record_type": "collection", "status": "candidate", "priority": "medium", "provider": "OPenn / University of Manchester (John Rylands Library)", "title": "John Rylands Library — Hebrew Manuscripts (OPenn)", "description": "Hebrew manuscripts from the University of Manchester John Rylands Library hosted on OPenn. JRL policy is CC BY 4.0 (not CC0 like most OPenn content) — verify per item.", "urls": {"canonical": "https://openn.library.upenn.edu/", "landing": "https://openn.library.upenn.edu/html/0021.html", "api": null, "download": null, "related": []}, "rights": {"rights_basis": "cc_by", "license_expression": "CC-BY-4.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "JRL policy is CC BY 4.0 per the ChatGPT survey; not CC0 like most OPenn content. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": "https://openn.library.upenn.edu/ReadMe.html", "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "ar", "other"], "document_types": ["manuscript", "codex", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": null}, "ingest": {"method": "api", "access_notes": "Bulk rsync/FTP on OPenn. Identify Manchester/JRL collection code. Attribution required for CC BY 4.0.", "agent_notes": "CRITICAL: Manchester's own digital collections viewer typically shows CC BY-NC terms. Use ONLY the OPenn-hosted copy (collection 0021) where the license is CC BY 4.0. Attribution required. Lower priority than CC0 OPenn collections, but commercially usable via OPenn path.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "Manchester / John Rylands Library Hebrew: CC BY 4.0 per JRL policy (not CC0 like most OPenn) — verify per item"}, {"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_3.md", "quote": "Important caveat: Manchester's own digital collections often show NC terms; use OPenn-hosted items where the CC BY 4.0 statement applies"}]} +{"source_id": "openn__zucker_ketubah_collection", "record_type": "collection", "status": "candidate", "priority": "high", "provider": "OPenn / Zucker Manuscript Library", "title": "Zucker Ketubah Collection (OPenn)", "description": "249 handwritten marriage contracts (ketubot) hosted on OPenn under CC0. High-value: varied hands, standardized form with free-form decoration, spanning centuries.", "urls": {"canonical": "https://openn.library.upenn.edu/", "landing": "https://openn.library.upenn.edu/html/0051.html", "api": null, "download": null, "related": []}, "rights": {"rights_basis": "public_domain", "license_expression": "CC0-1.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "OPenn platform CC0 policy. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": "https://openn.library.upenn.edu/ReadMe.html", "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "ar"], "document_types": ["ketubbah"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": 249}, "ingest": {"method": "api", "access_notes": "Identify Zucker collection directory on OPenn; rsync bulk download.", "agent_notes": "249 ketubot confirmed (17th–20th c., many regions/scripts). Well-scoped, manageable batch. Ketubot provide high handwriting diversity in a uniform document type — good for HTR training.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "Zucker Ketubah Collection: 249 marriage contracts, all PD"}, {"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_3.md", "quote": "249 historical ketubot, 17th–20th c., many regions/scripts; images are public domain / free of known copyright restrictions"}]} +{"source_id": "leipzig__hebrew_manuscripts", "record_type": "collection", "status": "candidate", "priority": "medium", "provider": "Leipzig University Library", "title": "Leipzig University Library — Hebrew Manuscripts", "description": "~68 Hebrew manuscripts released under Public Domain Mark. Accessible via Manuscripta Mediaevalia portal.", "urls": {"canonical": "https://www.manuscripta-mediaevalia.de/", "landing": "https://www.ub.uni-leipzig.de/en/research-library/digital-collections/hebrew-manuscripts", "api": null, "download": null, "related": []}, "rights": {"rights_basis": "public_domain", "license_expression": "PDM-1.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "Digital images released under PDM per ChatGPT survey. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "ar", "other"], "document_types": ["manuscript", "codex", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": 68}, "ingest": {"method": "manual_download", "access_notes": "Leipzig direct page gives rights status (Public Domain). Access is also via Ktiv/NLI aggregator. ~68 complete book manuscripts, 2 scrolls, and fragments — feasible as a manual batch.", "agent_notes": "Modest count (~68) makes this feasible as a manual batch. Verify per-item PDM status before ingesting.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "~68 Hebrew MSS; digital images released under PDM/PD."}, {"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_3.md", "quote": "68 complete book manuscripts, 2 scrolls, fragments; Listed as Public Domain; Leipzig's page gives the rights status"}]} +{"source_id": "mdz__hebrew_manuscripts", "record_type": "collection", "status": "candidate", "priority": "medium", "provider": "Bayerische Staatsbibliothek / MDZ", "title": "Bayerische Staatsbibliothek — Hebrew Manuscripts (MDZ)", "description": "~700 Hebrew manuscript pieces including 183 fragments (12th–18th c.), codices, Torah scrolls, and Esther scrolls. Only PDM-flagged items are safe for commercial use.", "urls": {"canonical": "https://www.digitale-sammlungen.de/", "landing": "https://www.digitale-sammlungen.de/en/hebrew-manuscripts", "api": null, "download": null, "related": []}, "rights": {"rights_basis": "mixed", "license_expression": null, "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "PDM-marked items are safe; other items may have scan-level licensing notes. Must filter to PDM-only before ingesting. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "ar", "other"], "document_types": ["manuscript", "codex", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": 700}, "ingest": {"method": "api", "access_notes": "MDZ has a public API (OAI-PMH and direct download). Filter to items with PDM tag; verify scan-page license before bulk download.", "agent_notes": "Scale: ~700 pieces including 183 fragments (12th–18th c.). Mixed rights — need per-item check. Only ingest PDM-flagged items. Consider as a lower-priority batch after OPenn and Leipzig are exhausted.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "PDM-marked items are safe. Restrict to PDM-flagged items; some items have licensing notes on the scan page."}, {"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_3.md", "quote": "~700 pieces incl. 183 fragments, 12th–18th c.; codices, Torah scrolls, Esther scrolls; only PDM items safe for commercial use"}]} +{"source_id": "archive__hebrew_manuscripts", "record_type": "collection", "status": "candidate", "priority": "low", "provider": "Internet Archive", "title": "Internet Archive — Hebrew Manuscripts (PDM uploads)", "description": "PDM-tagged Hebrew manuscript uploads from various libraries mirrored on the Internet Archive. Quality and format vary; useful as overflow/discovery source.", "urls": {"canonical": "https://archive.org/", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "public_domain", "license_expression": "PDM-1.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "PDM-tagged items only. Internet Archive does not guarantee rights; verify each item's stated license before ingesting. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "yi", "other"], "document_types": ["manuscript", "codex", "other"], "creator_names": [], "expected_handwriting": "mixed", "estimated_scan_count": null}, "ingest": {"method": "api", "access_notes": "Internet Archive S3-like API for bulk download; filter by subject:hebrew AND mediatype:texts AND licenseurl:PDM.", "agent_notes": "Low priority — many IA items duplicate OPenn/LoC sources with lower image quality. Named high-value items to target: Leningrad Codex, Aleppo Codex, Cervera Bible, Lailashi Codex, Haverford Masoretic Bible. Use as discovery/gap-fill after primary sources are exhausted. For commercial use, verify against the holding institution when possible.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "PDM-tagged uploads from various libraries. Quality and format vary. Useful as a discovery/overflow source."}, {"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_3.md", "quote": "Examples include Leningrad Codex, Aleppo Codex, Cervera Bible, Lailashi Codex, Haverford Masoretic Bible; many items marked PDM 1.0"}]} +{"source_id": "huggingface__sivan22_hebrew_handwritten", "record_type": "dataset", "status": "candidate", "priority": "low", "provider": "HuggingFace / sivan22", "title": "sivan22/hebrew-handwritten-dataset (HuggingFace)", "description": "Modern isolated handwritten Hebrew characters: 5,093 rows, train/test split, 28 classes (22 base letters + final forms). Licensed CC BY 3.0.", "urls": {"canonical": "https://huggingface.co/datasets/sivan22/hebrew-handwritten-dataset", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "cc_by", "license_expression": "CC-BY-3.0", "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "Dataset card on HuggingFace states CC BY 3.0. Note: tc11__hhd_v0 has conflicting CC BY-ND 3.0 label — these may share origin. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "modern", "languages": ["he"], "document_types": ["form", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": 5093}, "ingest": {"method": "dataset_download", "access_notes": "HuggingFace datasets API or direct download.", "agent_notes": "Character-level crops only, not page scans. License is CC BY 3.0 — note the project's AGENTS.md accepted list includes CC-BY-4.0 but not explicitly CC-BY-3.0; policy review needed before ingesting. Cross-check against tc11__hhd_v0 (rejected, CC BY-ND 3.0) — may share origin.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "5,093 rows of isolated Hebrew character crops; CC BY 3.0."}, {"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_3.md", "quote": "5,093 rows, train/test split, 28 classes; CC BY 3.0; main clearly permissive modern Hebrew handwriting dataset found"}]} {"source_id": "picryl__hebrew_manuscripts", "record_type": "collection", "status": "needs_review", "priority": "low", "provider": "PICRYL", "title": "PICRYL — Hebrew Manuscripts (aggregator)", "description": "Aggregator that mirrors PD items from NYPL, LoC, Europeana, and others with direct download links. Not a primary source; use for discovery and cross-referencing only.", "urls": {"canonical": "https://picryl.com/", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "unknown", "license_expression": null, "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "PICRYL aggregates items from primary PD sources; each item inherits the rights of its source. Do not treat PICRYL itself as a rights authority — always trace back to the original institution. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": null, "verification_status": "unverified", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "other"], "document_types": ["manuscript", "other"], "creator_names": [], "expected_handwriting": "mixed", "estimated_scan_count": null}, "ingest": {"method": "scrape", "access_notes": "Use as discovery tool; trace all items back to primary source institution before ingesting.", "agent_notes": "Aggregator only — do not cite PICRYL as a rights source. Use for finding items, then ingest from LoC/NYPL/OPenn directly.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "Not a primary source; use as discovery/cross-reference tool only."}]} {"source_id": "bnf__gallica_hebrew_manuscripts", "record_type": "collection", "status": "rejected", "priority": "exclude", "provider": "Bibliothèque nationale de France (BnF)", "title": "BnF Gallica — Hebrew Manuscripts", "description": "Large digitized Hebrew manuscript collection on Gallica. Commercial use not permitted under Gallica terms of use.", "urls": {"canonical": "https://gallica.bnf.fr/", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "restricted", "license_expression": null, "commercial_use_allowed": false, "derivatives_allowed": null, "scan_redistribution_allowed": false, "attribution_required": null, "evidence_text": "Commercial use not permitted under Gallica terms of use. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "other"], "document_types": ["manuscript", "codex", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": null}, "ingest": {"method": "unknown", "access_notes": null, "agent_notes": "Excluded due to commercial use restriction.", "blocked_reason": "rights_restriction"}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "Commercial use not permitted under Gallica terms"}]} {"source_id": "bodleian__hebrew_manuscripts", "record_type": "collection", "status": "rejected", "priority": "exclude", "provider": "Bodleian Libraries, University of Oxford", "title": "Bodleian Digital Library — Hebrew Manuscripts", "description": "Extensive Hebrew manuscript collection including Genizah fragments at the Bodleian. Non-commercial restriction on digital scan images.", "urls": {"canonical": "https://digital.bodleian.ox.ac.uk/", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "restricted", "license_expression": null, "commercial_use_allowed": false, "derivatives_allowed": null, "scan_redistribution_allowed": false, "attribution_required": null, "evidence_text": "Non-commercial restriction applies to digital scan images. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "ar", "other"], "document_types": ["manuscript", "fragment", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": null}, "ingest": {"method": "unknown", "access_notes": null, "agent_notes": "Excluded due to non-commercial restriction. Note: some Bodleian Genizah items appear on Wikimedia Commons under permissive licenses — those can be ingested individually as commons__ entries.", "blocked_reason": "rights_restriction"}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "Bodleian Digital Library Hebrew: non-commercial restriction on digital scans"}]} {"source_id": "vatican__digivatlib_hebrew", "record_type": "collection", "status": "rejected", "priority": "exclude", "provider": "Biblioteca Apostolica Vaticana", "title": "DigiVatLib — Hebrew Manuscripts (Vatican)", "description": "Large Hebrew manuscript collection at the Vatican Apostolic Library digitized on DigiVatLib. Commercial use explicitly forbidden.", "urls": {"canonical": "https://digi.vatlib.it/", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "restricted", "license_expression": null, "commercial_use_allowed": false, "derivatives_allowed": null, "scan_redistribution_allowed": false, "attribution_required": null, "evidence_text": "Commercial use explicitly forbidden per DigiVatLib terms. Source: docs/sources/chatgpt_summary_2.md.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "ar", "other"], "document_types": ["manuscript", "codex", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": null}, "ingest": {"method": "unknown", "access_notes": null, "agent_notes": "Excluded due to commercial use prohibition.", "blocked_reason": "rights_restriction"}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "Vatican DigiVatLib Hebrew: commercial use explicitly forbidden"}]} {"source_id": "github__hebhtr", "record_type": "dataset", "status": "rejected", "priority": "exclude", "provider": "GitHub (various contributors)", "title": "HebHTR — GitHub Hebrew HTR Dataset", "description": "Hebrew handwriting recognition dataset on GitHub. No clear permissive license on training data images.", "urls": {"canonical": "https://github.com/YontiLevin/HebHTR", "landing": null, "api": null, "download": null, "related": []}, "rights": {"rights_basis": "unknown", "license_expression": null, "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "No clear permissive license on the training image data (code license does not cover data). Source: docs/sources/chatgpt_summary_2.md.", "terms_url": null, "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "modern", "languages": ["he"], "document_types": ["form", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": null}, "ingest": {"method": "unknown", "access_notes": null, "agent_notes": "Excluded: code repo license does not extend to training data images. Rights on underlying images unverified.", "blocked_reason": "unclear_license"}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_2.md", "quote": "GitHub HebHTR: no clear permissive license on training data"}]} +{"source_id": "commons__hebrew_language_manuscripts", "record_type": "category", "status": "candidate", "priority": "medium", "provider": "Wikimedia Commons", "title": "Wikimedia Commons — Category:Hebrew-language manuscripts", "description": "Broad parent Commons category covering Hebrew-language manuscript images: 17 subcategories and ~105 direct files including Cairo Geniza items, Hebrew Bible manuscripts, illuminated Hebrew MSS, Wellcome Collection items, Damascus Pentateuch, and more. Mixed PD / CC / CC BY-SA by file.", "urls": {"canonical": "https://commons.wikimedia.org/wiki/Category:Hebrew-language_manuscripts", "landing": null, "api": null, "download": null, "related": ["https://commons.wikimedia.org/wiki/Category:Handwritten_Hebrew_letters", "https://commons.wikimedia.org/wiki/Category:Hebrew_calligraphy"]}, "rights": {"rights_basis": "mixed", "license_expression": null, "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "Category-level; individual file rights vary. Check each file page.", "terms_url": null, "verification_status": "unverified", "verified_at": null}, "scope": {"date_range": "mixed", "languages": ["he", "ar", "other"], "document_types": ["manuscript", "other"], "creator_names": [], "expected_handwriting": "mixed", "estimated_scan_count": 105}, "ingest": {"method": "api", "access_notes": "Use MediaWiki API to enumerate files and sub-categories. Verify each file page before ingesting — rights vary per file.", "agent_notes": "Discovery source: 17 subcategories + ~105 direct files. Broader parent of commons__handwritten_hebrew_letters and commons__hebrew_calligraphy. Do not assume category-wide rights — check each file. Priority: explore subcategories for overlooked material not yet in the corpus.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_3.md", "quote": "Commons category with 17 subcategories and 105 direct files, including Cairo Geniza, Hebrew Bible manuscripts, illuminated Hebrew MSS, Wellcome, Damascus Pentateuch"}]} +{"source_id": "commons__hebrew_calligraphy", "record_type": "category", "status": "candidate", "priority": "medium", "provider": "Wikimedia Commons", "title": "Wikimedia Commons — Category:Hebrew calligraphy", "description": "Commons category for Hebrew calligraphy images: ~74 files plus subcategories, including illuminated manuscripts and ketubot. Mixed PD / CC by file. Useful for calligraphy and letter-shape images, distinct from the handwriting-specific category.", "urls": {"canonical": "https://commons.wikimedia.org/wiki/Category:Hebrew_calligraphy", "landing": null, "api": null, "download": null, "related": ["https://commons.wikimedia.org/wiki/Category:Hebrew-language_manuscripts", "https://commons.wikimedia.org/wiki/Category:Handwritten_Hebrew_letters"]}, "rights": {"rights_basis": "mixed", "license_expression": null, "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "Category-level; individual file rights vary. Check each file page.", "terms_url": null, "verification_status": "unverified", "verified_at": null}, "scope": {"date_range": "mixed", "languages": ["he"], "document_types": ["manuscript", "other"], "creator_names": [], "expected_handwriting": "mixed", "estimated_scan_count": 74}, "ingest": {"method": "api", "access_notes": "Use MediaWiki API to enumerate files. Verify each file page before ingesting — rights vary per file.", "agent_notes": "~74 files plus subcategories. Good for illuminated MSS and ketubot. Distinct from the handwriting-focused categories — may surface decorative/scribal material. Do not assume category-wide rights — check each file.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_3.md", "quote": "74 files plus subcategories, including illuminated manuscripts and ketubot; useful for calligraphy/letter-shape images"}]} +{"source_id": "openn__judaica_collection_index", "record_type": "collection", "status": "candidate", "priority": "high", "provider": "OPenn, University of Pennsylvania Libraries", "title": "OPenn — Judaica Collection Index", "description": "Umbrella discovery index listing all OPenn Judaica repositories with individual license statements. Covers sub-collections tracked separately (Katz Center, Cairo Genizah, Manchester, Zucker) plus additional repositories not yet individually tracked — notably the Gaster Hebrew MSS and further Genizah fragment holdings. Individual repository licenses range from PD / CC0 to CC BY 4.0 to CC BY-SA 2.0.", "urls": {"canonical": "https://openn.library.upenn.edu/html/judaica_contents.html", "landing": "https://openn.library.upenn.edu/html/judaica_contents.html", "api": null, "download": null, "related": ["https://openn.library.upenn.edu/html/0002.html", "https://openn.library.upenn.edu/html/genizah_contents.html", "https://openn.library.upenn.edu/html/0021.html", "https://openn.library.upenn.edu/html/0051.html"]}, "rights": {"rights_basis": "mixed", "license_expression": null, "commercial_use_allowed": null, "derivatives_allowed": null, "scan_redistribution_allowed": null, "attribution_required": null, "evidence_text": "Collection-level; each OPenn repository has its own license statement on its contents page. Most are PD or CC0; some are CC BY 4.0 or CC BY-SA 2.0.", "terms_url": "https://openn.library.upenn.edu/ReadMe.html", "verification_status": "source_note_only", "verified_at": null}, "scope": {"date_range": "medieval-modern", "languages": ["he", "ar", "other"], "document_types": ["manuscript", "codex", "letter", "other"], "creator_names": [], "expected_handwriting": "yes", "estimated_scan_count": null}, "ingest": {"method": "api", "access_notes": "Use this index page to discover all OPenn Judaica sub-collections. Each listed repository has its own contents page with item-level metadata and direct image downloads. Bulk rsync/FTP from openn.library.upenn.edu per sub-collection directory.", "agent_notes": "This is the discovery index for ALL OPenn Judaica repositories. Sub-collections already tracked individually: Katz Center (0002), Cairo Genizah, Manchester (0021), Zucker (0051). Additional collections to explore: Gaster Hebrew MSS, other Genizah fragment holdings. Always verify per-repository license before ingesting — most are CC0/PD but some are CC BY 4.0 or CC BY-SA 2.0.", "blocked_reason": null}, "evidence": [{"kind": "repo_note", "citation": "docs/sources/chatgpt_summary_3.md", "quote": "OPenn Collection of Judaica: Curated Judaica collection including Hebrew MSS, ketubot, Gaster Hebrew MSS, Genizah fragments; mixed but mostly useful individual repository licenses"}]} diff --git a/datapackage.json b/datapackage.json index d7c9d0a..0453c14 100644 --- a/datapackage.json +++ b/datapackage.json @@ -67,7 +67,7 @@ "record_count": 345 }, { - "bytes": 180071, + "bytes": 190593, "description": "Source-level catalog. One JSON object per institution, collection, item, dataset, or source lead.", "encoding": "utf-8", "format": "jsonl", @@ -75,7 +75,7 @@ "name": "sources", "path": "data/index/sources.jsonl", "profile": "data-resource", - "record_count": 90 + "record_count": 93 } ], "schemas": { @@ -106,9 +106,9 @@ }, "record_count": 345, "scan_byte_count": 389193522, - "source_record_count": 90, + "source_record_count": 93, "source_status_breakdown": { - "candidate": 13, + "candidate": 16, "needs_review": 1, "rejected": 16, "verified": 60 diff --git a/docs/sources/chatgpt_summary_3.md b/docs/sources/chatgpt_summary_3.md new file mode 100644 index 0000000..92b7898 --- /dev/null +++ b/docs/sources/chatgpt_summary_3.md @@ -0,0 +1,101 @@ +# ChatGPT Research Survey 3 — Prioritised Commercial-Use Hebrew Manuscript Sources + +*Ingested 2026-05-24. Prompt: best reusable/commercial-safe sources for Hebrew manuscript scans, +with explicit licence verification.* + +--- + +## Key finding + +Most useful bulk sources are OPenn sub-collections (CC0 or CC BY 4.0) and the Library of Congress +(public domain). Modern handwriting datasets (HHD) are non-commercial; the one exception is the +HuggingFace sivan22 character-level dataset (CC BY 3.0 — needs policy decision). + +--- + +## Recommended sources (commercial-safe) + +### Bulk / highest-volume first + +| Source | Approx. size | Licence | OPenn/direct URL | +|--------|-------------|---------|-----------------| +| **OPenn — British Library Hebrew MSS** | ~1,300 MSS / ~435,000 images | CC0 1.0 (OPenn platform) | `openn.library.upenn.edu/html/0032_contents.html` | +| **Library of Congress — Hebraic Manuscripts** | ~230 MSS (Heb, Judeo-Arabic, Judeo-Persian, Yiddish) | PD / no known copyright restrictions | `loc.gov/collections/hebraic-manuscripts/about-this-collection/` | +| **OPenn — Cairo Genizah project** | Large (Bible fragments, Judeo-Arabic, ketubot, letters, legal docs, prayers) | PD / no known copyright restrictions | `openn.library.upenn.edu/html/genizah_contents.html` | +| **Leipzig University Library — Hebrew MSS** | 68 complete book MSS, 2 scrolls, fragments | Public Domain | `ub.uni-leipzig.de/en/research-library/digital-collections/hebrew-manuscripts` | +| **OPenn — University of Pennsylvania / Katz Center** | Penn Judaica holdings | PD / no known copyright restrictions | `openn.library.upenn.edu/html/0002.html` | +| **OPenn — Collection of Judaica (index)** | All OPenn Judaica repos incl. Gaster Hebrew MSS | Mixed: PD / CC0 / CC BY 4.0 / CC BY-SA 2.0 per repo | `openn.library.upenn.edu/html/judaica_contents.html` | +| **OPenn — University of Manchester Hebrew MSS** | Manchester / John Rylands Hebrew MSS | **CC BY 4.0 via OPenn** *(see caveat below)* | `openn.library.upenn.edu/html/0021.html` | +| **OPenn — Zucker Family Ketubah Collection** | 249 ketubot, 17th–20th c., many regions/scripts | PD / no known copyright restrictions | `openn.library.upenn.edu/html/0051.html` | +| **Wikimedia Commons — Category:Hebrew-language manuscripts** | ~105 direct files + 17 subcategories | Mixed PD / CC / CC BY-SA per file | `commons.wikimedia.org/wiki/Category:Hebrew-language_manuscripts` | +| **Wikimedia Commons — Category:Hebrew calligraphy** | ~74 files + subcategories | Mixed per file | `commons.wikimedia.org/wiki/Category:Hebrew_calligraphy` | +| **NYPL — Hebrew Illuminated Manuscripts** | **1,174 results** (PD-filtered subset) | PD items only | `digitalcollections.nypl.org/collections/hebrew-illuminated-manuscripts` | +| **MDZ / BSB — Hebrew Manuscripts** | **~700 pieces incl. 183 fragments** (12th–18th c.) | PDM items only | `digitale-sammlungen.de/en/hebrew-manuscripts` | +| **Internet Archive — PD Hebrew MSS** | Varies | PDM 1.0 (verify per upload) | `archive.org/` | + +### Modern character-level handwriting + +| Source | Size | Licence | Notes | +|--------|------|---------|-------| +| **HuggingFace — sivan22/hebrew-handwritten-dataset** | 5,093 rows, 28 classes | **CC BY 3.0** | Characters only, not pages. Needs policy decision — AGENTS.md lists CC-BY-4.0 but not explicitly CC-BY-3.0. | +| **TC-11 HHD_v0** | Isolated characters | CC BY-ND 3.0 | **Rejected per project policy** — CC-BY-ND is on the AGENTS.md reject list. | + +--- + +## Manchester caveat (important) + +Manchester's **own** digital collections viewer (`digitalcollections.manchester.ac.uk`) typically +shows CC BY-NC terms. The OPenn-hosted copy (`openn.library.upenn.edu/html/0021.html`) carries +**CC BY 4.0** — use only the OPenn path. Do not use Manchester's viewer as a rights source for +this project. + +--- + +## Sources to avoid / use with caution + +| Source | Reason | +|--------|--------| +| **BnF / Gallica** | Non-commercial restriction on commercial reuse | +| **Digital Bodleian** | NC restriction | +| **Vatican DigiVatLib** | Commercial use forbidden | +| **Manchester direct (own viewer)** | CC BY-NC — use OPenn copy instead | +| **HHD_gender / HHD_age** | Non-commercial research only / CC BY-NC-SA | +| **HebHTR (GitHub)** | No clear permissive data licence | +| **Ktiv / NLI directly** | Item-specific rights; some allow any use, others require approval — check each item page | + +--- + +## Recommended ingestion priority order + +From the survey, the recommended batch sequence for maximum usable volume: + +1. **OPenn BL** (~435K images, CC0, bulk rsync) +2. **LoC** (~230 MSS, PD, JSON API) +3. **OPenn Cairo Genizah** (large, CC0) +4. **Leipzig** (~68 MSS, PD, manageable manual batch) +5. **OPenn Judaica** (Penn/Katz + index → discover Gaster MSS etc.) +6. **OPenn Manchester** (CC BY 4.0 via OPenn only) +7. **OPenn Zucker** (249 ketubot, CC0) +8. **Commons** (per-file, category exploration) +9. **NYPL** (PD-filtered subset of 1,174 results) +10. **MDZ** (PDM-only items from ~700 pieces) +11. **Internet Archive** (PDM mirrors, gap-fill only) +12. **HuggingFace sivan22** (CC BY 3.0, character-level, policy decision needed) + +--- + +## Updates applied to sources.jsonl from this survey + +- `openn__bl_hebrew_manuscripts`: added landing URL, scale (435K images), Polonsky Foundation link +- `openn__cairo_genizah_fragments`: added specific contents URL +- `openn__manchester_hebrew_manuscripts`: added specific URL, added Manchester NC vs OPenn CC BY 4.0 caveat +- `openn__katz_center_judaica`: added specific URL +- `openn__zucker_ketubah_collection`: added specific URL, confirmed 249 ketubot +- `leipzig__hebrew_manuscripts`: updated URL to direct Leipzig page +- `nypl__hebrew_manuscripts_digital_collections`: added 1,174 count to notes +- `mdz__hebrew_manuscripts`: added specific URL, scale (~700 pieces incl. 183 fragments) +- `archive__hebrew_manuscripts`: added named high-value items to notes +- `huggingface__sivan22_hebrew_handwritten`: added CC BY 3.0 licence, 5,093 / 28-class details, policy note +- **New**: `commons__hebrew_language_manuscripts` (17 subcats + 105 files) +- **New**: `commons__hebrew_calligraphy` (74 files + subcats) +- **New**: `openn__judaica_collection_index` (umbrella discovery page) diff --git a/exports/sources.csv b/exports/sources.csv index 205501b..a4174c5 100644 --- a/exports/sources.csv +++ b/exports/sources.csv @@ -1,5 +1,5 @@ source_id,record_type,status,priority,provider,title,description,urls_canonical,urls_landing,urls_api,urls_download,urls_related,rights_basis,license_expression,commercial_use_allowed,derivatives_allowed,scan_redistribution_allowed,attribution_required,rights_evidence_text,rights_terms_url,rights_verification_status,rights_verified_at,scope_date_range,scope_languages,scope_document_types,scope_creator_names,scope_expected_handwriting,scope_estimated_scan_count,ingest_method,ingest_access_notes,ingest_agent_notes,ingest_blocked_reason,evidence_count -archive__hebrew_manuscripts,collection,candidate,low,Internet Archive,Internet Archive — Hebrew Manuscripts (PDM uploads),PDM-tagged Hebrew manuscript uploads from various libraries mirrored on the Internet Archive. Quality and format vary; useful as overflow/discovery source.,https://archive.org/,,,,,public_domain,PDM-1.0,,,,,PDM-tagged items only. Internet Archive does not guarantee rights; verify each item's stated license before ingesting. Source: docs/sources/chatgpt_summary_2.md.,,source_note_only,,medieval-modern,he; yi; other,manuscript; codex; other,,mixed,,api,Internet Archive S3-like API for bulk download; filter by subject:hebrew AND mediatype:texts AND licenseurl:PDM.,Low priority — many IA items duplicate OPenn/LoC sources with lower image quality. Use as discovery/gap-fill after primary sources are exhausted.,,1 +archive__hebrew_manuscripts,collection,candidate,low,Internet Archive,Internet Archive — Hebrew Manuscripts (PDM uploads),PDM-tagged Hebrew manuscript uploads from various libraries mirrored on the Internet Archive. Quality and format vary; useful as overflow/discovery source.,https://archive.org/,,,,,public_domain,PDM-1.0,,,,,PDM-tagged items only. Internet Archive does not guarantee rights; verify each item's stated license before ingesting. Source: docs/sources/chatgpt_summary_2.md.,,source_note_only,,medieval-modern,he; yi; other,manuscript; codex; other,,mixed,,api,Internet Archive S3-like API for bulk download; filter by subject:hebrew AND mediatype:texts AND licenseurl:PDM.,"Low priority — many IA items duplicate OPenn/LoC sources with lower image quality. Named high-value items to target: Leningrad Codex, Aleppo Codex, Cervera Bible, Lailashi Codex, Haverford Masoretic Bible. Use as discovery/gap-fill after primary sources are exhausted. For commercial use, verify against the holding institution when possible.",,2 bl__hebrew_collection,collection,rejected,exclude,British Library,British Library — Hebrew Manuscripts and Cairo Genizah Collection,Extensive Hebrew manuscripts and Cairo Genizah fragments. Terms frequently restrict commercial reuse or require paid permissions for publication.,https://www.bl.uk/hebrew-manuscripts,,,,,restricted,,false,,false,,Terms frequently restrict commercial reuse or require paid permissions for publication. Source: docs/sources/gemini_summary_2.md.,,source_note_only,,medieval-modern,he,manuscript; other,,yes,,manual_download,Rights must be verified per item before any ingestion.,Flagged as avoid unless specific items with confirmed permissive rights are identified.,rights_restriction,1 bnf__gallica_hebrew_manuscripts,collection,rejected,exclude,Bibliothèque nationale de France (BnF),BnF Gallica — Hebrew Manuscripts,Large digitized Hebrew manuscript collection on Gallica. Commercial use not permitted under Gallica terms of use.,https://gallica.bnf.fr/,,,,,restricted,,false,,false,,Commercial use not permitted under Gallica terms of use. Source: docs/sources/chatgpt_summary_2.md.,,source_note_only,,medieval-modern,he; other,manuscript; codex; other,,yes,,unknown,,Excluded due to commercial use restriction.,rights_restriction,1 bodleian__hebrew_manuscripts,collection,rejected,exclude,"Bodleian Libraries, University of Oxford",Bodleian Digital Library — Hebrew Manuscripts,Extensive Hebrew manuscript collection including Genizah fragments at the Bodleian. Non-commercial restriction on digital scan images.,https://digital.bodleian.ox.ac.uk/,,,,,restricted,,false,,false,,Non-commercial restriction applies to digital scan images. Source: docs/sources/chatgpt_summary_2.md.,,source_note_only,,medieval-modern,he; ar; other,manuscript; fragment; other,,yes,,unknown,,Excluded due to non-commercial restriction. Note: some Bodleian Genizah items appear on Wikimedia Commons under permissive licenses — those can be ingested individually as commons__ entries.,rights_restriction,1 @@ -23,6 +23,8 @@ commons__grodzinski_letter_about_kook,item,verified,high,Wikimedia Commons,Lette commons__halper113_midrash_david_colophon,item,verified,high,Wikimedia Commons,Halper 113 Cairo Geniza colophon: Midrash David on Genesis (1299),"Cairo Geniza leaf containing the dated colophon (1299) of Midrash David on Genesis, signed by David ben Abraham Maimuni (1222/3-1300), grandson of Maimonides. Penn Halper 113.","https://commons.wikimedia.org/wiki/File:Halper_113_Midrash_David_on_Genesis,_colophon,_Cairo_Geniza.jpg","https://commons.wikimedia.org/wiki/File:Halper_113_Midrash_David_on_Genesis,_colophon,_Cairo_Geniza.jpg",,,,public_domain,PDM-1.0,true,true,true,false,This work is in the public domain in its country of origin and other countries and areas where the copyright term is the author's life plus 100 years or fewer.,https://creativecommons.org/publicdomain/mark/1.0/,primary_page_checked,2026-05-11,1299,he,marginalia,David ben Abraham Maimuni,yes,1,manual_download,"Downloaded the original file from upload.wikimedia.org with a descriptive User-Agent and ~2s spacing between requests, per the Commons etiquette in AGENTS.md.",License-template text on the Commons file page checked on 2026-05-11 and quoted verbatim in rights.evidence_text.,,1 commons__halper462_exilarch_genealogy,item,verified,medium,Wikimedia Commons,Halper 462 Cairo Geniza genealogy of the Exilarchs (12th century),"Cairo Geniza leaf compiling the genealogy of the Babylonian Exilarchs back to King David and Adam, written by Avraham ben Tamim al-Rahbi in the 12th century. Penn Halper 462.","https://commons.wikimedia.org/wiki/File:Halper_462_Genealogy_of_the_Exilarchs_to_David_and_Adam,_Cairo_Geniza.jpg","https://commons.wikimedia.org/wiki/File:Halper_462_Genealogy_of_the_Exilarchs_to_David_and_Adam,_Cairo_Geniza.jpg",,,,public_domain,PDM-1.0,true,true,true,false,This work is in the public domain in its country of origin and other countries and areas where the copyright term is the author's life plus 100 years or fewer.,https://creativecommons.org/publicdomain/mark/1.0/,primary_page_checked,2026-05-11,12th century,he,other,Avraham ben Tamim al-Rahbi,yes,1,manual_download,"Downloaded the original file from upload.wikimedia.org with a descriptive User-Agent and ~2s spacing between requests, per the Commons etiquette in AGENTS.md.",License-template text on the Commons file page checked on 2026-05-11 and quoted verbatim in rights.evidence_text.,,1 commons__hatikvah_imber_manuscript,item,verified,high,Wikimedia Commons,Naftali Herz Imber's handwritten 'Hatikvah' manuscript (1907-1908),"Handwritten Hebrew manuscript of the first stanza and chorus of 'Hatikvah' by Naftali Herz Imber (d. 1909), with author signature, dated September 1907 - September 1908.",https://commons.wikimedia.org/wiki/File:Hatikvah.jpg,,https://commons.wikimedia.org/w/api.php?action=query&titles=File:Hatikvah.jpg&prop=imageinfo&iiprop=url%7Csize%7Cmime&format=json,https://upload.wikimedia.org/wikipedia/commons/5/59/Hatikvah.jpg,,public_domain,PDM-1.0,true,true,true,false,"The author died in 1909, so this work is in the public domain in its country of origin and other countries and areas where the copyright term is the author's life plus 100 years or fewer.",https://creativecommons.org/publicdomain/mark/1.0/,primary_page_checked,2026-05-09,1907-1908,he,poem,Naftali Herz Imber,yes,1,manual_download,Downloaded original Commons JPEG.,Short excerpt (first stanza + chorus + signature); flag length in quality.notes.,,1 +commons__hebrew_calligraphy,category,candidate,medium,Wikimedia Commons,Wikimedia Commons — Category:Hebrew calligraphy,"Commons category for Hebrew calligraphy images: ~74 files plus subcategories, including illuminated manuscripts and ketubot. Mixed PD / CC by file. Useful for calligraphy and letter-shape images, distinct from the handwriting-specific category.",https://commons.wikimedia.org/wiki/Category:Hebrew_calligraphy,,,,https://commons.wikimedia.org/wiki/Category:Hebrew-language_manuscripts; https://commons.wikimedia.org/wiki/Category:Handwritten_Hebrew_letters,mixed,,,,,,Category-level; individual file rights vary. Check each file page.,,unverified,,mixed,he,manuscript; other,,mixed,74,api,Use MediaWiki API to enumerate files. Verify each file page before ingesting — rights vary per file.,~74 files plus subcategories. Good for illuminated MSS and ketubot. Distinct from the handwriting-focused categories — may surface decorative/scribal material. Do not assume category-wide rights — check each file.,,1 +commons__hebrew_language_manuscripts,category,candidate,medium,Wikimedia Commons,Wikimedia Commons — Category:Hebrew-language manuscripts,"Broad parent Commons category covering Hebrew-language manuscript images: 17 subcategories and ~105 direct files including Cairo Geniza items, Hebrew Bible manuscripts, illuminated Hebrew MSS, Wellcome Collection items, Damascus Pentateuch, and more. Mixed PD / CC / CC BY-SA by file.",https://commons.wikimedia.org/wiki/Category:Hebrew-language_manuscripts,,,,https://commons.wikimedia.org/wiki/Category:Handwritten_Hebrew_letters; https://commons.wikimedia.org/wiki/Category:Hebrew_calligraphy,mixed,,,,,,Category-level; individual file rights vary. Check each file page.,,unverified,,mixed,he; ar; other,manuscript; other,,mixed,105,api,Use MediaWiki API to enumerate files and sub-categories. Verify each file page before ingesting — rights vary per file.,Discovery source: 17 subcategories + ~105 direct files. Broader parent of commons__handwritten_hebrew_letters and commons__hebrew_calligraphy. Do not assume category-wide rights — check each file. Priority: explore subcategories for overlooked material not yet in the corpus.,,1 commons__hirsch_torah_letter_1878,item,verified,medium,Wikimedia Commons,S. R. Hirsch handwritten Hebrew Torah-thoughts letter (1878),"Three-sheet handwritten Hebrew letter on Torah thoughts and fundamentals by Rabbi Samson Raphael Hirsch (1808-1888), written in Frankfurt, dated 1878. Adds 19th-century German rabbinical Hebrew hand to the corpus.","https://commons.wikimedia.org/wiki/File:Letters_with_Torah_thoughts_and_fundamentals_of_Torah_in_Hebrew,_from_Rabbi_Samson_Refael_Hirsch._1878.jpg",,,https://upload.wikimedia.org/wikipedia/commons/4/43/Letters_with_Torah_thoughts_and_fundamentals_of_Torah_in_Hebrew%2C_from_Rabbi_Samson_Refael_Hirsch._1878.jpg,"https://commons.wikimedia.org/wiki/File:Letters_with_Torah_thoughts_and_fundamentals_of_Torah_in_Hebrew,_from_Rabbi_Samson_Refael_Hirsch._1878.II.jpg; https://commons.wikimedia.org/wiki/File:Letters_with_Torah_thoughts_and_fundamentals_of_Torah_in_Hebrew,_from_Rabbi_Samson_Refael_Hirsch._1878.III.jpg",public_domain,PDM-1.0,true,true,true,false,"The author died in 1888, so this work is in the public domain in its country of origin and other countries and areas where the copyright term is the author's life plus 100 years or fewer.",https://creativecommons.org/publicdomain/mark/1.0/,primary_page_checked,2026-05-11,1878,he,letter,Samson Raphael Hirsch,yes,3,manual_download,"Downloaded all three Commons JPEG sheets (I, II, III).",Frankfurt provenance; 19th-century German rabbinical Hebrew hand.,,1 commons__judah_halevi_letter_ts_8j18_5,item,verified,high,Wikimedia Commons,Cairo Geniza letter attributed to Judah ha-Levi (T-S 8J18.5),"Cairo Geniza letter attributed to the poet and philosopher Judah ha-Levi (d. 1141), Cambridge Taylor-Schechter shelf-mark T-S 8J18.5.",https://commons.wikimedia.org/wiki/File:Letter_(T-S_8J18.5).jpg,https://commons.wikimedia.org/wiki/File:Letter_(T-S_8J18.5).jpg,,,,public_domain,PDM-1.0,true,true,true,false,This work is in the public domain in its country of origin and other countries and areas where the copyright term is the author's life plus 70 years or fewer.,https://creativecommons.org/publicdomain/mark/1.0/,primary_page_checked,2026-05-11,circa 1140,he,letter,Judah ha-Levi,yes,1,manual_download,"Downloaded the original file from upload.wikimedia.org with a descriptive User-Agent and ~2s spacing between requests, per the Commons etiquette in AGENTS.md.",License-template text on the Commons file page checked on 2026-05-11 and quoted verbatim in rights.evidence_text.,,1 commons__kafka_hebrew_writings,item,verified,high,Wikimedia Commons,Franz Kafka — Hebrew language exercise notebook spread (Max Brod estate),"Open-notebook spread of Franz Kafka's late-life Hebrew language exercises, from the Max Brod literary estate at the National Library of Israel. The Hebrew here is Kafka's own hand; scholars have noted his Hebrew teacher's hand also appears in this notebook elsewhere.",https://commons.wikimedia.org/wiki/File:Franz_Kafka_-_Hebrew_writings_-_Literary_Estate_Max_Brod_-_National_Library_of_Israel.jpg,https://commons.wikimedia.org/wiki/File:Franz_Kafka_-_Hebrew_writings_-_Literary_Estate_Max_Brod_-_National_Library_of_Israel.jpg,,,,public_domain,PDM-1.0,true,true,true,false,This work is in the public domain in its country of origin and other countries and areas where the copyright term is the author's life plus 70 years or fewer.,https://creativecommons.org/publicdomain/mark/1.0/,primary_page_checked,2026-05-11,before 1925,he,notebook,Franz Kafka,yes,1,manual_download,"Downloaded the original file from upload.wikimedia.org with a descriptive User-Agent and ~2s spacing between requests, per the Commons etiquette in AGENTS.md.",License-template text on the Commons file page checked on 2026-05-11 and quoted verbatim in rights.evidence_text.,,1 @@ -63,11 +65,11 @@ harvard__judaica_hollis_archival_discovery,collection,candidate,medium,Harvard L hhd__age_kaggle,dataset,rejected,exclude,Kaggle / HHD,HHD_age,Hebrew Handwritten Dataset age subset; useful reference but not compatible with remix-friendly corpus goals.,https://www.kaggle.com/datasets/liorabergel/hhd-age,,,,,restricted,CC-BY-NC-SA-4.0,false,,,true,Seed notes report non-commercial/share-alike terms and research-purpose restrictions.,,source_note_only,,modern,he,form,,yes,850,dataset_download,Do not include in remix-friendly release bundles unless relicensed.,May remain as a negative/restricted source record for search completeness.,Non-commercial restriction conflicts with downstream remix and commercial use.,2 hhd__gender_zenodo,dataset,rejected,exclude,Zenodo / HHD,HHD_gender,Hebrew Handwritten Dataset gender subset; source notes identify research-only or non-commercial restrictions.,https://zenodo.org/records/4729908,,,,,restricted,,false,,,,Seed notes report non-commercial academic/research-only constraints.,,source_note_only,,modern,he,form,,yes,819,dataset_download,Do not include in remix-friendly release bundles unless relicensed.,Keep as excluded lead to prevent accidental ingestion.,Research-only or non-commercial use conflicts with dataset goals.,1 hhd__v0_tc11,dataset,rejected,exclude,TC11 / HHD,HHD_v0 isolated characters,Isolated Hebrew character dataset; licensing notes include no-derivatives or conflicting terms.,https://tc11.cvc.uab.es/datasets/HHD_v0_1,https://huggingface.co/datasets/sivan22/hebrew-handwritten-dataset,,,,restricted,CC-BY-ND-3.0,,false,,true,Seed notes report CC BY-ND 3.0 / conflicting mirrors; no-derivatives conflicts with substantial transformation.,,source_note_only,,modern,he,form; other,,yes,,dataset_download,Do not include in remix-friendly release bundles unless primary authors grant compatible terms.,"Useful only as external reference for HTR, not as corpus content.",No-derivatives/conflicting license terms conflict with dataset goals.,2 -huggingface__sivan22_hebrew_handwritten,dataset,candidate,low,HuggingFace / sivan22,sivan22/hebrew-handwritten-dataset (HuggingFace),"5,093 rows of isolated Hebrew character crops under CC BY 3.0. Not page-level scans; useful as external HTR reference but not corpus content.",https://huggingface.co/datasets/sivan22/hebrew-handwritten-dataset,,,,,cc_by,CC-BY-3.0,,,,,Dataset card on HuggingFace states CC BY 3.0. Note: tc11__hhd_v0 has conflicting CC BY-ND 3.0 label — these may share origin. Source: docs/sources/chatgpt_summary_2.md.,,source_note_only,,modern,he,form; other,,yes,5093,dataset_download,HuggingFace datasets API or direct download.,"Character-level crops only, not page scans. Cross-check license against tc11__hhd_v0 — may share origin with conflicting mirrors.",,1 +huggingface__sivan22_hebrew_handwritten,dataset,candidate,low,HuggingFace / sivan22,sivan22/hebrew-handwritten-dataset (HuggingFace),"Modern isolated handwritten Hebrew characters: 5,093 rows, train/test split, 28 classes (22 base letters + final forms). Licensed CC BY 3.0.",https://huggingface.co/datasets/sivan22/hebrew-handwritten-dataset,,,,,cc_by,CC-BY-3.0,,,,,Dataset card on HuggingFace states CC BY 3.0. Note: tc11__hhd_v0 has conflicting CC BY-ND 3.0 label — these may share origin. Source: docs/sources/chatgpt_summary_2.md.,,source_note_only,,modern,he,form; other,,yes,5093,dataset_download,HuggingFace datasets API or direct download.,"Character-level crops only, not page scans. License is CC BY 3.0 — note the project's AGENTS.md accepted list includes CC-BY-4.0 but not explicitly CC-BY-3.0; policy review needed before ingesting. Cross-check against tc11__hhd_v0 (rejected, CC BY-ND 3.0) — may share origin.",,2 jabotinsky__zeev_jabotinsky_archive,collection,rejected,exclude,Jabotinsky Institute Archive,Ze'ev Jabotinsky handwritten archive items,"Archive leads for handwritten notes and drafts by Ze'ev Jabotinsky, including Hebrew Accent and Population Exchange notes.",https://en.jabotinsky.org/archive/search-archive/item/?itemId=115024,https://en.jabotinsky.org/archive/catalog-of-files/?section=A&arc=9704&page=78,,,https://en.jabotinsky.org/archive/search-archive/item/?itemId=115421,restricted,,false,false,false,,"Terms of Use (en.jabotinsky.org/about-us/terms-of-use/, verified 2026-05-23): content restricted to personal/educational/non-commercial use only; commercial publication and exploitation explicitly prohibited; no modifications permitted; commercial use requires prior written permission and possible usage fee. ML/HTR dataset use is out of scope.",https://en.jabotinsky.org/about-us/terms-of-use/,primary_page_checked,2026-05-23,1930-1939,he; yi; de,draft; speech; letter; other,Ze'ev Jabotinsky,mixed,,manual_download,Verify downloadable PDFs and terms before copying scans into repo.,1928 speech notes are out of post-1929 scope but may inform handwriting style; do not include as entry unless scope changes.,rights_restriction,2 -leipzig__hebrew_manuscripts,collection,candidate,medium,Leipzig University Library,Leipzig University Library — Hebrew Manuscripts,~68 Hebrew manuscripts released under Public Domain Mark. Accessible via Manuscripta Mediaevalia portal.,https://www.manuscripta-mediaevalia.de/,,,,,public_domain,PDM-1.0,,,,,Digital images released under PDM per ChatGPT survey. Source: docs/sources/chatgpt_summary_2.md.,,source_note_only,,medieval-modern,he; ar; other,manuscript; codex; other,,yes,68,manual_download,Search Manuscripta Mediaevalia for Leipzig Hebrew items; download image series per manuscript.,Modest count (~68) makes this feasible as a manual batch. Verify per-item PDM status before ingesting.,,1 +leipzig__hebrew_manuscripts,collection,candidate,medium,Leipzig University Library,Leipzig University Library — Hebrew Manuscripts,~68 Hebrew manuscripts released under Public Domain Mark. Accessible via Manuscripta Mediaevalia portal.,https://www.manuscripta-mediaevalia.de/,https://www.ub.uni-leipzig.de/en/research-library/digital-collections/hebrew-manuscripts,,,,public_domain,PDM-1.0,,,,,Digital images released under PDM per ChatGPT survey. Source: docs/sources/chatgpt_summary_2.md.,,source_note_only,,medieval-modern,he; ar; other,manuscript; codex; other,,yes,68,manual_download,"Leipzig direct page gives rights status (Public Domain). Access is also via Ktiv/NLI aggregator. ~68 complete book manuscripts, 2 scrolls, and fragments — feasible as a manual batch.",Modest count (~68) makes this feasible as a manual batch. Verify per-item PDM status before ingesting.,,2 loc__hebrew_manuscripts_collection,collection,candidate,high,Library of Congress,Library of Congress — Hebrew Manuscripts Collection,"Handwritten texts, religious commentaries, and drafts spanning centuries. US government entity; items lacking known copyright restrictions are free for general use. Robust JSON API for programmatic access.",https://www.loc.gov/collections/hebraic-manuscripts/about-this-collection/,,,,,public_domain,PDM-1.0,,,,,LoC is a US government entity; items lacking known copyright restrictions are free for general use. Source: docs/sources/gemini_summary_2.md.,,source_note_only,,medieval-modern,he; yi; other,manuscript; draft; commentary; other,,yes,,api,LoC JSON API available; query Hebrew Manuscripts collection and download JPEG/TIFF programmatically.,"Need to verify per-item rights, as some items may have added restrictions from donor institutions.",,1 -mdz__hebrew_manuscripts,collection,candidate,medium,Bayerische Staatsbibliothek / MDZ,Bayerische Staatsbibliothek — Hebrew Manuscripts (MDZ),Large digitized Hebrew collection on the Münchner Digitalisierungszentrum portal. PDM-marked items are safe for unrestricted use; some items carry additional licensing notes.,https://www.digitale-sammlungen.de/,,,,,mixed,,,,,,PDM-marked items are safe; other items may have scan-level licensing notes. Must filter to PDM-only before ingesting. Source: docs/sources/chatgpt_summary_2.md.,,source_note_only,,medieval-modern,he; ar; other,manuscript; codex; other,,yes,,api,MDZ has a public API (OAI-PMH and direct download). Filter to items with PDM tag; verify scan-page license before bulk download.,Mixed rights — need per-item check. Only ingest PDM-flagged items. Consider as a lower-priority batch after OPenn and Leipzig are exhausted.,,1 +mdz__hebrew_manuscripts,collection,candidate,medium,Bayerische Staatsbibliothek / MDZ,Bayerische Staatsbibliothek — Hebrew Manuscripts (MDZ),"~700 Hebrew manuscript pieces including 183 fragments (12th–18th c.), codices, Torah scrolls, and Esther scrolls. Only PDM-flagged items are safe for commercial use.",https://www.digitale-sammlungen.de/,https://www.digitale-sammlungen.de/en/hebrew-manuscripts,,,,mixed,,,,,,PDM-marked items are safe; other items may have scan-level licensing notes. Must filter to PDM-only before ingesting. Source: docs/sources/chatgpt_summary_2.md.,,source_note_only,,medieval-modern,he; ar; other,manuscript; codex; other,,yes,700,api,MDZ has a public API (OAI-PMH and direct download). Filter to items with PDM tag; verify scan-page license before bulk download.,Scale: ~700 pieces including 183 fragments (12th–18th c.). Mixed rights — need per-item check. Only ingest PDM-flagged items. Consider as a lower-priority batch after OPenn and Leipzig are exhausted.,,2 nli__hannah_senesh_archive,collection,verified,seed,National Library of Israel,Hannah Senesh Archive,"Collection-level lead for Hannah Senesh diaries, manuscripts, correspondence, and related handwritten materials.",https://www.nli.org.il/en/archives/nnl_archive_al997009165988705171/NLI,https://www.nli.org.il/en/at-your-service/announcements/hannah-szenes-archive,,,,public_domain,LicenseRef-Public-Domain-Israel,,,,false,Seed notes report item pages marked Any Use Permitted and Public Domain in Israel.,,source_note_only,,1936-1944,he; hu,diary; notebook; draft; poem; letter; other,Hannah Senesh,yes,,manual_download,Prioritize item-level NLI archive records with explicit download access and rights labels.,Promote only item pages with primary-page rights evidence; mixed Hebrew/Hungarian pages need page-level language tagging.,,2 nli__nnl_aleph990025684880205171,item,candidate,medium,National Library of Israel,יומן מהשואה by Elimelech Bash,Hebrew-script manuscript diary lead with public-domain rights claim; creation date needs verification.,https://www.nli.org.il/en/manuscripts/NNL_ALEPH990025684880205171/NLI,,,,,public_domain,LicenseRef-Public-Domain-Israel,,,,false,Seed note says Any Use Permitted and Public Domain in Israel.,,source_note_only,,,he,diary,Elimelech Bash,yes,,manual_download,Verify date written is after 1929 before inclusion.,Promising non-Senesh NLI seed.,,1 nli__nnl_archive_al990035403420205171,item,rejected,exclude,National Library of Israel,"Hybrid Notebook (מחברת-שעטנז), Shaul Tchernichovsky",A mixture of notes and observations in handwritten Hebrew.,https://www.nli.org.il/he/archives/NNL_ARCHIVE_AL990035403420205171/NLI,,,,,restricted,,false,,false,,"NLI item page rights checked 2026-05-23: 3 of 4 items marked ""permitted for research and study purposes only""; 1 item does not permit redistribution. None meet the dataset requirement of free redistribution and transformation for ML/downstream use.",,primary_page_checked,2026-05-23,1930s-1943,he,notebook,Shaul Tchernichovsky,yes,,manual_download,NLI Cloudflare blocks automated access; requires manual browser download.,Download as nli_tchern_notebook.zip via NLI download button.,rights_restriction,1 @@ -80,12 +82,13 @@ nli__nnl_archive_al997009912248405171,item,verified,high,National Library of Isr nli__nnl_archive_al997009912248505171,item,verified,seed,National Library of Israel,Handwritten Diary of Hannah Szenes in Hebrew and Draft of The Violin,"Strong item-level seed: handwritten Hebrew diary plus play draft, dated 1941-1944.",https://www.nli.org.il/en/archives/NNL_ARCHIVE_AL997009912248505171/NLI,,,,,public_domain,LicenseRef-Public-Domain-Israel,,,,false,Seed note says the item page says Any Use Permitted and Public Domain in Israel.,,source_note_only,,1941-1944,he,diary; draft,Hannah Senesh,yes,,manual_download,Verify direct image/PDF access and split multi-page item into entries.,Good first page-level ingestion candidate.,,1 nli__nnl_archive_al997009912248705171,item,verified,high,National Library of Israel,Handwritten Diary of Hannah Szenes in Hebrew and Hungarian,Mixed Hebrew/Hungarian handwritten diary dated 1938-1941.,https://www.nli.org.il/en/archives/NNL_ARCHIVE_AL997009912248705171/NLI,,,,,public_domain,LicenseRef-Public-Domain-Israel,,,,false,"Seed note says Any Use Permitted and Public Domain in Israel, but remote access may need confirmation.",,source_note_only,,1938-1941,he; hu,diary,Hannah Senesh,yes,,manual_download,Confirm whether online access is available outside the NLI building.,Tag Hebrew pages separately from Hungarian-heavy pages.,,1 nli__shaul_tchernichovsky_archive_items,collection,rejected,exclude,National Library of Israel,Shaul Tchernichovsky handwritten archive items,"NLI leads for receipts, literary drafts, notebooks, and memorandum drafts by Shaul Tchernichovsky.",https://www.nli.org.il/he/archives/NNL_ARCHIVE_AL990035912210205171/NLI,,,,https://www.nli.org.il/he/archives/NNL_ARCHIVE_AL990035912380205171/NLI; https://www.nli.org.il/he/archives/NNL_ARCHIVE_AL990035403420205171/NLI; https://www.nli.org.il/he/archives/NNL_ARCHIVE_AL990035912230205171/NLI,unknown,,,,,,"All 4 known NLI items checked 2026-05-23: rights are either ""research and study only"" or no redistribution allowed. Entire cluster out of scope for this dataset.",,primary_page_checked,2026-05-23,1930s-1943,he,receipt; draft; notebook; other,Shaul Tchernichovsky,yes,,manual_download,Expand each NLI item into its own source row before harvesting scans.,Useful for handwriting diversity beyond Senesh.,rights_restriction,1 -nypl__hebrew_manuscripts_digital_collections,collection,candidate,high,New York Public Library,"NYPL Digital Collections — Hebrew Manuscripts, Ketubbot, and Letters","Hebrew Illuminated Manuscripts, historical Ketubbot (handwritten marriage contracts), and early modern letters. ~1,174 items in the Hebrew Illuminated Manuscripts sub-collection. Out-of-copyright materials are completely free for any use including commercial.",https://digitalcollections.nypl.org/,https://digitalcollections.nypl.org/collections/hebrew-illuminated-manuscripts,,,,public_domain,PDM-1.0,,,,,"NYPL policy: out-of-copyright digital materials are completely free for any use including commercial, no permission required. Source: docs/sources/gemini_summary_2.md.",,source_note_only,,medieval-modern,he; yi; other,manuscript; ketubbah; letter; other,,yes,1174,api,Filter by public domain on portal; public API available for programmatic download. Need to identify specific in-scope Hebrew handwriting items.,"Ketubbot (marriage contracts) are particularly promising — high volume, post-1929 dates possible, standardized form but varied handwriting.",,1 -openn__bl_hebrew_manuscripts,collection,candidate,high,OPenn / British Library,British Library — Hebrew Manuscripts (OPenn),"~1,300 Hebrew manuscripts from the British Library digitized and hosted on OPenn under CC0 1.0 Universal. Separate from bl.uk which restricts commercial use.",https://openn.library.upenn.edu/,,,,,cc0,CC0-1.0,,,,,OPenn platform policy: all content CC0 1.0 Universal. BL Hebrew MSS are hosted under this umbrella separately from bl.uk licensing. Source: docs/sources/chatgpt_summary_2.md.,https://openn.library.upenn.edu/ReadMe.html,source_note_only,,medieval-modern,he; ar; other,manuscript; codex; other,,yes,1300,api,Bulk rsync or FTP from openn.library.upenn.edu. Identify BL sub-collection directory path and filter by Hebrew handwriting.,"High-value lead: CC0, ~1,300 MSS, separate from restrictive bl.uk terms. Confirm exact OPenn collection code for BL.",,1 -openn__cairo_genizah_fragments,collection,candidate,high,OPenn (multiple holding institutions),Cairo Genizah Fragments — OPenn hosted collections,Genizah fragments contributed by multiple institutions and hosted on OPenn under CC0. Distinct from Cambridge Digital Library Genizah (which restricts commercial use).,https://openn.library.upenn.edu/,,,,,cc0,CC0-1.0,,,,,OPenn platform-wide CC0 policy applies. Source: docs/sources/chatgpt_summary_2.md.,https://openn.library.upenn.edu/ReadMe.html,source_note_only,,medieval,he; ar; jrb; other,manuscript; fragment; other,,yes,,api,Identify Genizah sub-collection directories on OPenn; use rsync for bulk access.,Confirm which OPenn institution codes hold Genizah material; cross-check with cambridge__digital_library_hebrew_genizah (excluded — commercial restriction).,,1 -openn__katz_center_judaica,collection,candidate,high,"OPenn, University of Pennsylvania Libraries",Katz Center for Advanced Judaic Studies — Hebrew Manuscripts (OPenn),"Hundreds of digitized handwritten Hebrew manuscripts, codices, and historical documents. All OPenn content is CC0 1.0 Universal.",https://openn.library.upenn.edu/,https://openn.library.upenn.edu/ReadMe.html,,,,public_domain,CC0-1.0,,,,,OPenn platform policy: all content released under CC0 1.0 Universal (Public Domain Dedication). Source: docs/sources/gemini_summary_2.md.,https://openn.library.upenn.edu/ReadMe.html,source_note_only,,medieval-modern,he; yi; ar; other,manuscript; codex; letter; other,,yes,,api,"Bulk rsync or FTP access (openn.library.upenn.edu). Need to identify specific Judaica sub-collections with Hebrew handwriting in scope, then filter by date range.","High-value lead: CC0 license, bulk access, and OPenn is explicitly designed for computational use.",,1 -openn__manchester_hebrew_manuscripts,collection,candidate,medium,OPenn / University of Manchester (John Rylands Library),John Rylands Library — Hebrew Manuscripts (OPenn),Hebrew manuscripts from the University of Manchester John Rylands Library hosted on OPenn. JRL policy is CC BY 4.0 (not CC0 like most OPenn content) — verify per item.,https://openn.library.upenn.edu/,,,,,cc_by,CC-BY-4.0,,,,,JRL policy is CC BY 4.0 per the ChatGPT survey; not CC0 like most OPenn content. Source: docs/sources/chatgpt_summary_2.md.,https://openn.library.upenn.edu/ReadMe.html,source_note_only,,medieval-modern,he; ar; other,manuscript; codex; other,,yes,,api,Bulk rsync/FTP on OPenn. Identify Manchester/JRL collection code. Attribution required for CC BY 4.0.,"Lower priority than CC0 OPenn collections due to attribution requirement, but still commercially usable.",,1 -openn__zucker_ketubah_collection,collection,candidate,high,OPenn / Zucker Manuscript Library,Zucker Ketubah Collection (OPenn),"249 handwritten marriage contracts (ketubot) hosted on OPenn under CC0. High-value: varied hands, standardized form with free-form decoration, spanning centuries.",https://openn.library.upenn.edu/,,,,,public_domain,CC0-1.0,,,,,OPenn platform CC0 policy. Source: docs/sources/chatgpt_summary_2.md.,https://openn.library.upenn.edu/ReadMe.html,source_note_only,,medieval-modern,he; ar,ketubbah,,yes,249,api,Identify Zucker collection directory on OPenn; rsync bulk download.,"249 ketubot is a well-scoped, manageable batch. Ketubot provide high handwriting diversity in a uniform document type — good for HTR training.",,1 +nypl__hebrew_manuscripts_digital_collections,collection,candidate,high,New York Public Library,"NYPL Digital Collections — Hebrew Manuscripts, Ketubbot, and Letters","Hebrew Illuminated Manuscripts, historical Ketubbot (handwritten marriage contracts), and early modern letters. ~1,174 items in the Hebrew Illuminated Manuscripts sub-collection. Out-of-copyright materials are completely free for any use including commercial.",https://digitalcollections.nypl.org/,https://digitalcollections.nypl.org/collections/hebrew-illuminated-manuscripts,,,,public_domain,PDM-1.0,,,,,"NYPL policy: out-of-copyright digital materials are completely free for any use including commercial, no permission required. Source: docs/sources/gemini_summary_2.md.",,source_note_only,,medieval-modern,he; yi; other,manuscript; ketubbah; letter; other,,yes,1174,api,Filter by public domain on portal; public API available for programmatic download. Need to identify specific in-scope Hebrew handwriting items.,"Hebrew Illuminated Manuscripts sub-collection has 1,174 results. Filter by public domain on portal; public API available for programmatic download. Ketubbot are particularly promising — high volume, post-1929 dates possible, standardised form but varied handwriting. Must filter to 'public domain' items only before downloading.",,2 +openn__bl_hebrew_manuscripts,collection,candidate,high,OPenn / British Library,British Library — Hebrew Manuscripts (OPenn),"~1,300 Hebrew manuscripts from the British Library digitized and hosted on OPenn under CC0 1.0 Universal. Separate from bl.uk which restricts commercial use.",https://openn.library.upenn.edu/,https://openn.library.upenn.edu/html/0032_contents.html,,,https://polonskyfoundation.org/cultural-heritage-and-digitisation/polonsky-foundation-catalogue-of-digitised-hebrew-manuscripts/,cc0,CC0-1.0,,,,,OPenn platform policy: all content CC0 1.0 Universal. BL Hebrew MSS are hosted under this umbrella separately from bl.uk licensing. Source: docs/sources/chatgpt_summary_2.md.,https://openn.library.upenn.edu/ReadMe.html,source_note_only,,medieval-modern,he; ar; other,manuscript; codex; other,,yes,435000,api,Bulk rsync or FTP from openn.library.upenn.edu. Identify BL sub-collection directory path and filter by Hebrew handwriting.,"High-value lead: CC0, ~1,300 MSS / ~435,000 images (Polonsky Foundation digitisation project), separate from restrictive bl.uk terms. OPenn collection code: 0032 (Polonsky BL Hebrew). Survey (chatgpt_summary_3.md) recommends using OPenn rather than the BL viewer.",,2 +openn__cairo_genizah_fragments,collection,candidate,high,OPenn (multiple holding institutions),Cairo Genizah Fragments — OPenn hosted collections,Genizah fragments contributed by multiple institutions and hosted on OPenn under CC0. Distinct from Cambridge Digital Library Genizah (which restricts commercial use).,https://openn.library.upenn.edu/,https://openn.library.upenn.edu/html/genizah_contents.html,,,,cc0,CC0-1.0,,,,,OPenn platform-wide CC0 policy applies. Source: docs/sources/chatgpt_summary_2.md.,https://openn.library.upenn.edu/ReadMe.html,source_note_only,,medieval,he; ar; jrb; other,manuscript; fragment; other,,yes,,api,OPenn Genizah contents page: https://openn.library.upenn.edu/html/genizah_contents.html. Identify Genizah sub-collection directories on OPenn; use rsync for bulk access.,Confirm which OPenn institution codes hold Genizah material; cross-check with cambridge__digital_library_hebrew_genizah (excluded — commercial restriction).,,2 +openn__judaica_collection_index,collection,candidate,high,"OPenn, University of Pennsylvania Libraries",OPenn — Judaica Collection Index,"Umbrella discovery index listing all OPenn Judaica repositories with individual license statements. Covers sub-collections tracked separately (Katz Center, Cairo Genizah, Manchester, Zucker) plus additional repositories not yet individually tracked — notably the Gaster Hebrew MSS and further Genizah fragment holdings. Individual repository licenses range from PD / CC0 to CC BY 4.0 to CC BY-SA 2.0.",https://openn.library.upenn.edu/html/judaica_contents.html,https://openn.library.upenn.edu/html/judaica_contents.html,,,https://openn.library.upenn.edu/html/0002.html; https://openn.library.upenn.edu/html/genizah_contents.html; https://openn.library.upenn.edu/html/0021.html; https://openn.library.upenn.edu/html/0051.html,mixed,,,,,,Collection-level; each OPenn repository has its own license statement on its contents page. Most are PD or CC0; some are CC BY 4.0 or CC BY-SA 2.0.,https://openn.library.upenn.edu/ReadMe.html,source_note_only,,medieval-modern,he; ar; other,manuscript; codex; letter; other,,yes,,api,Use this index page to discover all OPenn Judaica sub-collections. Each listed repository has its own contents page with item-level metadata and direct image downloads. Bulk rsync/FTP from openn.library.upenn.edu per sub-collection directory.,"This is the discovery index for ALL OPenn Judaica repositories. Sub-collections already tracked individually: Katz Center (0002), Cairo Genizah, Manchester (0021), Zucker (0051). Additional collections to explore: Gaster Hebrew MSS, other Genizah fragment holdings. Always verify per-repository license before ingesting — most are CC0/PD but some are CC BY 4.0 or CC BY-SA 2.0.",,1 +openn__katz_center_judaica,collection,candidate,high,"OPenn, University of Pennsylvania Libraries",Katz Center for Advanced Judaic Studies — Hebrew Manuscripts (OPenn),"Hundreds of digitized handwritten Hebrew manuscripts, codices, and historical documents. All OPenn content is CC0 1.0 Universal.",https://openn.library.upenn.edu/,https://openn.library.upenn.edu/html/0002.html,,,,public_domain,CC0-1.0,,,,,OPenn platform policy: all content released under CC0 1.0 Universal (Public Domain Dedication). Source: docs/sources/gemini_summary_2.md.,https://openn.library.upenn.edu/ReadMe.html,source_note_only,,medieval-modern,he; yi; ar; other,manuscript; codex; letter; other,,yes,,api,"Bulk rsync or FTP access (openn.library.upenn.edu). Need to identify specific Judaica sub-collections with Hebrew handwriting in scope, then filter by date range.","High-value lead: CC0 license, bulk access, and OPenn is explicitly designed for computational use.",,2 +openn__manchester_hebrew_manuscripts,collection,candidate,medium,OPenn / University of Manchester (John Rylands Library),John Rylands Library — Hebrew Manuscripts (OPenn),Hebrew manuscripts from the University of Manchester John Rylands Library hosted on OPenn. JRL policy is CC BY 4.0 (not CC0 like most OPenn content) — verify per item.,https://openn.library.upenn.edu/,https://openn.library.upenn.edu/html/0021.html,,,,cc_by,CC-BY-4.0,,,,,JRL policy is CC BY 4.0 per the ChatGPT survey; not CC0 like most OPenn content. Source: docs/sources/chatgpt_summary_2.md.,https://openn.library.upenn.edu/ReadMe.html,source_note_only,,medieval-modern,he; ar; other,manuscript; codex; other,,yes,,api,Bulk rsync/FTP on OPenn. Identify Manchester/JRL collection code. Attribution required for CC BY 4.0.,"CRITICAL: Manchester's own digital collections viewer typically shows CC BY-NC terms. Use ONLY the OPenn-hosted copy (collection 0021) where the license is CC BY 4.0. Attribution required. Lower priority than CC0 OPenn collections, but commercially usable via OPenn path.",,2 +openn__zucker_ketubah_collection,collection,candidate,high,OPenn / Zucker Manuscript Library,Zucker Ketubah Collection (OPenn),"249 handwritten marriage contracts (ketubot) hosted on OPenn under CC0. High-value: varied hands, standardized form with free-form decoration, spanning centuries.",https://openn.library.upenn.edu/,https://openn.library.upenn.edu/html/0051.html,,,,public_domain,CC0-1.0,,,,,OPenn platform CC0 policy. Source: docs/sources/chatgpt_summary_2.md.,https://openn.library.upenn.edu/ReadMe.html,source_note_only,,medieval-modern,he; ar,ketubbah,,yes,249,api,Identify Zucker collection directory on OPenn; rsync bulk download.,"249 ketubot confirmed (17th–20th c., many regions/scripts). Well-scoped, manageable batch. Ketubot provide high handwriting diversity in a uniform document type — good for HTR training.",,2 picryl__hebrew_manuscripts,collection,needs_review,low,PICRYL,PICRYL — Hebrew Manuscripts (aggregator),"Aggregator that mirrors PD items from NYPL, LoC, Europeana, and others with direct download links. Not a primary source; use for discovery and cross-referencing only.",https://picryl.com/,,,,,unknown,,,,,,PICRYL aggregates items from primary PD sources; each item inherits the rights of its source. Do not treat PICRYL itself as a rights authority — always trace back to the original institution. Source: docs/sources/chatgpt_summary_2.md.,,unverified,,medieval-modern,he; other,manuscript; other,,mixed,,scrape,Use as discovery tool; trace all items back to primary source institution before ingesting.,"Aggregator only — do not cite PICRYL as a rights source. Use for finding items, then ingest from LoC/NYPL/OPenn directly.",,1 vatican__digivatlib_hebrew,collection,rejected,exclude,Biblioteca Apostolica Vaticana,DigiVatLib — Hebrew Manuscripts (Vatican),Large Hebrew manuscript collection at the Vatican Apostolic Library digitized on DigiVatLib. Commercial use explicitly forbidden.,https://digi.vatlib.it/,,,,,restricted,,false,,false,,Commercial use explicitly forbidden per DigiVatLib terms. Source: docs/sources/chatgpt_summary_2.md.,,source_note_only,,medieval-modern,he; ar; other,manuscript; codex; other,,yes,,unknown,,Excluded due to commercial use prohibition.,rights_restriction,1 wikimedia__handwritten_hebrew_letters,category,verified,medium,Wikimedia Commons,Category: Handwritten Hebrew letters,Commons category lead for freely licensed or public-domain handwritten Hebrew letter images and related media.,https://commons.wikimedia.org/wiki/Category:Handwritten_Hebrew_letters,,,,https://commons.wikimedia.org/wiki/Category:Hebrew_handwriting_scripts,mixed,PDM-1.0,true,true,true,false,"Per-file verification: 2 qualifying files ingested. File:Delacroix letter.png uses PD-old-100; File:Solitreo contract.jpg uses PD-Art|PD-old-70. Most other category files are SVG teaching samples, character-level crops, or CC-BY-SA 3.0 (excluded). Mixed rights overall; ingested items are all PDM-1.0.",,primary_page_checked,2026-05-15,mixed,he; lad,letter; other,,mixed,2,api,Use MediaWiki API and file pages; exclude SVG teaching samples unless the dataset explicitly wants vector handwriting examples.,"Ingested 2 qualifying handwritten scans from Category:Handwritten_Hebrew_letters and subcategory Category:Solitreo_script (under Category:Hebrew_handwriting_scripts). Most files in the main category were excluded: SVG teaching samples, tiny character crops (<55px), group photographs, and CC-BY-SA 3.0 licensed files. Files with {{Wrong license}} template were excluded. Two public-domain Solitreo script documents qualified.",,2