Skip to content

decide whether source.scan_url is required, and define its dereference semantics #2

@shaypal5

Description

@shaypal5

Spun out of the senior-dev review of #1.

Background

letter_set.v1 currently makes source.scan_url optional. Variants are anchored by source.scan_entry_id, which resolves against the upstream entries.jsonl index in HeOCR/public-domain-hand-written-hebrew-scans.

That works for downstream consumers that re-resolve the upstream index, but it leaves a gap for consumers that want to fetch the source image directly without re-resolving.

Decision needed

Either:

  1. Keep optional, document the dereference path. Add a note in docs/letter_set_v1.md saying consumers must dereference via the upstream index pinned in the document's upstream.repo / upstream.revision, and that source.scan_url is a non-authoritative convenience copy.
  2. Make required. Then we need to also decide whether scan_url must point at the canonical upstream URL or whether mirrors are allowed, and whether the URL is stable across upstream revisions.

Scope

  • Update the schema (src/hletterscriptgen/schemas/letter_set.schema.json) if option 2.
  • Update docs/letter_set_v1.md and docs/upstream_integration.md either way.
  • Add tests under tests/test_validation.py if shape changes.

Impact

This is a contract surface. Resolve before any consumer outside the HeOCR org takes a dependency on letter_set.v1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:rights-policyLICENSE-POLICY, rights carryover, eligibilityarea:schemaletter_set.v1 JSON Schema and contractstatus:needs-discussionDesign discussion needed before work proceedstype:refactorCode restructuring without behavior change

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions