Observation
pyensembl calls gtfparse.create_missing_features to backfill transcript and gene rows that the GTF omits (legitimate when the upstream only emits exon / CDS). The synthesizer derives the parent feature from the child rows it sees, which is usually correct — but when the input GTF deviates in ways the synthesizer doesn't anticipate, downstream consumers have to clean up.
Two recent pyensembl issues are downstream of this:
In both cases the inconsistency is structural to the input GTF rather than to gtfparse's logic, but a caller-side knob would help.
Proposal
Add a strict / policy kwarg to create_missing_features:
strict="raise" — when a child row lacks attributes the synthesizer would need to faithfully build the parent, raise.
strict="warn" — log + skip the synthesized row.
strict="best_effort" — current behavior; synthesize whatever we can, fill missing attrs with None.
Or more granularly: a required_attributes_per_feature: dict[str, set[str]] knob so callers can pin "if you're going to synthesize a transcript row, only do so when the child exon carries exon_id".
Why it'd help
The two issues above both look like "pyensembl handles partial GTFs" but the upstream signal that something's off (exon_id missing, start_codon missing) gets eaten by the synthesizer and lives later as a crash at query time.
Scope guard
- Existing default behavior should stay best-effort. This is purely additive.
- Bacteria/archaea GFFs with their own synthesis quirks are out of scope for this issue (separate parser ecosystem).
Observation
pyensembl calls
gtfparse.create_missing_featuresto backfilltranscriptandgenerows that the GTF omits (legitimate when the upstream only emitsexon/CDS). The synthesizer derives the parent feature from the child rows it sees, which is usually correct — but when the input GTF deviates in ways the synthesizer doesn't anticipate, downstream consumers have to clean up.Two recent pyensembl issues are downstream of this:
exon_id.create_missing_featuressynthesizestranscriptrows that don't carryexon_ideither, and pyensembl'sTranscript.exonsthen crashed withsqlite3.OperationalError: no such column: exon_id. Workaround lives in pyensembl: acolumn_exists("exon", "exon_id")check attranscript.py:149..completemethod ofTranscriptclass return a wrong result for some transcript ids. pyensembl#252 / v2.6.4 — TAIRchr_patch_hapl_scafffragments containCDSrows without matchingstart_codon/stop_codon. After feature synthesis the transcript row exists butcoding_sequencecouldn't compute its endpoints. Workaround in pyensembl:Transcript.coding_sequencereturnsNonerather than raising.In both cases the inconsistency is structural to the input GTF rather than to gtfparse's logic, but a caller-side knob would help.
Proposal
Add a
strict/policykwarg tocreate_missing_features:strict="raise"— when a child row lacks attributes the synthesizer would need to faithfully build the parent, raise.strict="warn"— log + skip the synthesized row.strict="best_effort"— current behavior; synthesize whatever we can, fill missing attrs with None.Or more granularly: a
required_attributes_per_feature: dict[str, set[str]]knob so callers can pin "if you're going to synthesize atranscriptrow, only do so when the childexoncarriesexon_id".Why it'd help
The two issues above both look like "pyensembl handles partial GTFs" but the upstream signal that something's off (
exon_idmissing,start_codonmissing) gets eaten by the synthesizer and lives later as a crash at query time.Scope guard