create_missing_features: opt-in / configurable strictness for synthesized parent rows

## Observation

pyensembl calls `gtfparse.create_missing_features` to backfill `transcript` and `gene` rows that the GTF omits (legitimate when the upstream only emits `exon` / `CDS`). The synthesizer derives the parent feature from the child rows it sees, which is usually correct — but when the input GTF deviates in ways the synthesizer doesn't anticipate, downstream consumers have to clean up.

Two recent pyensembl issues are downstream of this:

- **openvax/pyensembl#331 / v2.6.4** — older Ensembl releases (e.g. release 54) omit `exon_id`. `create_missing_features` synthesizes `transcript` rows that don't carry `exon_id` either, and pyensembl's `Transcript.exons` then crashed with `sqlite3.OperationalError: no such column: exon_id`. Workaround lives in pyensembl: a `column_exists("exon", "exon_id")` check at `transcript.py:149`.
- **openvax/pyensembl#252 / v2.6.4** — TAIR `chr_patch_hapl_scaff` fragments contain `CDS` rows without matching `start_codon`/`stop_codon`. After feature synthesis the transcript row exists but `coding_sequence` couldn't compute its endpoints. Workaround in pyensembl: `Transcript.coding_sequence` returns `None` rather than raising.

In both cases the inconsistency is structural to the input GTF rather than to gtfparse's logic, but a caller-side knob would help.

## Proposal

Add a `strict` / `policy` kwarg to `create_missing_features`:

- `strict="raise"` — when a child row lacks attributes the synthesizer would need to faithfully build the parent, raise.
- `strict="warn"` — log + skip the synthesized row.
- `strict="best_effort"` — current behavior; synthesize whatever we can, fill missing attrs with None.

Or more granularly: a `required_attributes_per_feature: dict[str, set[str]]` knob so callers can pin "if you're going to synthesize a `transcript` row, only do so when the child `exon` carries `exon_id`".

## Why it'd help

The two issues above both look like "pyensembl handles partial GTFs" but the upstream signal that something's off (`exon_id` missing, `start_codon` missing) gets eaten by the synthesizer and lives later as a crash at query time.

## Scope guard

- Existing default behavior should stay best-effort. This is purely additive.
- Bacteria/archaea GFFs with their own synthesis quirks are out of scope for this issue (separate parser ecosystem).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create_missing_features: opt-in / configurable strictness for synthesized parent rows #65

Observation

Proposal

Why it'd help

Scope guard

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

create_missing_features: opt-in / configurable strictness for synthesized parent rows #65

Description

Observation

Proposal

Why it'd help

Scope guard

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions