Background
GTF files don't enforce that a protein-coding transcript has both a start_codon and a stop_codon row — partial assemblies (e.g. TAIR chr_patch_hapl_scaff fragments, openvax/pyensembl#252) drop one or both even though CDS rows are present. Downstream consumers (pyensembl in particular) discover this at query time — Transcript.coding_sequence had to learn to return None rather than raise.
If gtfparse optionally surfaced this incompleteness at parse time, downstream tooling could decide its policy up front instead of catching KeyError / ValueError per-transcript.
Proposal
A transcript_completeness (or required_features_complete) DataFrame column, opt-in via a kwarg:
df = read_gtf(path, flag_incomplete_transcripts=True)
# adds a column 'has_start_codon' and 'has_stop_codon' aggregated over rows
# sharing transcript_id, or a single 'transcript_complete' boolean.
Or returned as a separate companion DataFrame indexed by transcript_id, so it doesn't bloat the main GTF DataFrame for callers that don't ask.
Scope guard
- Arguably out of scope for a pure parser — "what's complete" is downstream policy. But the parse already scans every row to expand attributes; adding an aggregated boolean is cheap.
- The exact set of "required" features is opinionated. The proposal here is the common case (
CDS rows imply you need start_codon and stop_codon for that transcript to be translatable).
Related
Background
GTF files don't enforce that a protein-coding transcript has both a
start_codonand astop_codonrow — partial assemblies (e.g. TAIRchr_patch_hapl_scafffragments, openvax/pyensembl#252) drop one or both even thoughCDSrows are present. Downstream consumers (pyensembl in particular) discover this at query time —Transcript.coding_sequencehad to learn to returnNonerather than raise.If gtfparse optionally surfaced this incompleteness at parse time, downstream tooling could decide its policy up front instead of catching
KeyError/ValueErrorper-transcript.Proposal
A
transcript_completeness(orrequired_features_complete) DataFrame column, opt-in via a kwarg:Or returned as a separate companion DataFrame indexed by
transcript_id, so it doesn't bloat the main GTF DataFrame for callers that don't ask.Scope guard
CDSrows imply you needstart_codonandstop_codonfor that transcript to be translatable).Related
.completemethod ofTranscriptclass return a wrong result for some transcript ids. pyensembl#252 — TAIR fragmentscoding_sequence_position_rangesinitially didn't include stop codon