Skip to content

Optional incompleteness flag for required features (start_codon / stop_codon / CDS pairs) #66

@iskandr

Description

@iskandr

Background

GTF files don't enforce that a protein-coding transcript has both a start_codon and a stop_codon row — partial assemblies (e.g. TAIR chr_patch_hapl_scaff fragments, openvax/pyensembl#252) drop one or both even though CDS rows are present. Downstream consumers (pyensembl in particular) discover this at query time — Transcript.coding_sequence had to learn to return None rather than raise.

If gtfparse optionally surfaced this incompleteness at parse time, downstream tooling could decide its policy up front instead of catching KeyError / ValueError per-transcript.

Proposal

A transcript_completeness (or required_features_complete) DataFrame column, opt-in via a kwarg:

df = read_gtf(path, flag_incomplete_transcripts=True)
# adds a column 'has_start_codon' and 'has_stop_codon' aggregated over rows
# sharing transcript_id, or a single 'transcript_complete' boolean.

Or returned as a separate companion DataFrame indexed by transcript_id, so it doesn't bloat the main GTF DataFrame for callers that don't ask.

Scope guard

  • Arguably out of scope for a pure parser — "what's complete" is downstream policy. But the parse already scans every row to expand attributes; adding an aggregated boolean is cheap.
  • The exact set of "required" features is opinionated. The proposal here is the common case (CDS rows imply you need start_codon and stop_codon for that transcript to be translatable).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions