Skip to content

create_missing_features: opt-in / configurable strictness for synthesized parent rows #65

@iskandr

Description

@iskandr

Observation

pyensembl calls gtfparse.create_missing_features to backfill transcript and gene rows that the GTF omits (legitimate when the upstream only emits exon / CDS). The synthesizer derives the parent feature from the child rows it sees, which is usually correct — but when the input GTF deviates in ways the synthesizer doesn't anticipate, downstream consumers have to clean up.

Two recent pyensembl issues are downstream of this:

In both cases the inconsistency is structural to the input GTF rather than to gtfparse's logic, but a caller-side knob would help.

Proposal

Add a strict / policy kwarg to create_missing_features:

  • strict="raise" — when a child row lacks attributes the synthesizer would need to faithfully build the parent, raise.
  • strict="warn" — log + skip the synthesized row.
  • strict="best_effort" — current behavior; synthesize whatever we can, fill missing attrs with None.

Or more granularly: a required_attributes_per_feature: dict[str, set[str]] knob so callers can pin "if you're going to synthesize a transcript row, only do so when the child exon carries exon_id".

Why it'd help

The two issues above both look like "pyensembl handles partial GTFs" but the upstream signal that something's off (exon_id missing, start_codon missing) gets eaten by the synthesizer and lives later as a crash at query time.

Scope guard

  • Existing default behavior should stay best-effort. This is purely additive.
  • Bacteria/archaea GFFs with their own synthesis quirks are out of scope for this issue (separate parser ecosystem).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions