Skip to content

Fix #169: three-tier protein-coding biotype ontology#357

Merged
iskandr merged 1 commit into
mainfrom
fix-169-extended-protein-coding-flags
May 13, 2026
Merged

Fix #169: three-tier protein-coding biotype ontology#357
iskandr merged 1 commit into
mainfrom
fix-169-extended-protein-coding-flags

Conversation

@iskandr

@iskandr iskandr commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Resolves #169. Adds a layered set of biotype flags for "does this transcript make a polypeptide?":

  • is_protein_coding — unchanged. Strict canonical protein_coding only. Preserves the contract that downstream effect predictors (e.g. varcode) depend on.
  • is_protein_coding_extended — widens to IG/TR gene segments and translated pseudogenes:
    • IG_{C,D,J,V}_gene, TR_{C,D,J,V}_gene — produce protein after V(D)J recombination
    • polymorphic_pseudogene — codes for protein in some individuals
    • translated_{processed,unprocessed}_pseudogene — pseudogenes with translation evidence
    • Excludes NMD/NSD because those products are targeted for degradation rather than stable accumulation.
  • is_translated — widest tier. Adds nonsense_mediated_decay and non_stop_decay on top. Useful when picking a top variant effect or doing RNA-seq peptide analysis where ribosome occupancy matters more than stable expression.

The invariant strict ⊂ extended ⊂ translated holds and is tested.

The set constants PROTEIN_CODING_BIOTYPES, EXTENDED_PROTEIN_CODING_BIOTYPES, and TRANSLATED_BIOTYPES are exported from pyensembl.locus_with_genome for callers who want to derive their own categorization.

Test plan

  • New tests/test_extended_protein_coding.py covers:
    • monotonic subset invariants between the three sets
    • exact membership for the extras at each tier (IG/TR/translated pseudogenes; NMD/NSD)
    • per-biotype property dispatch across all three tiers, including biotype=None
    • back-compat with the existing mouse partial fixture
  • pytest passes locally (56 passed)
  • ruff check clean

Introduces a layered set of flags for "does this transcript make a
polypeptide?":

* `is_protein_coding` - unchanged, strict canonical `protein_coding`
  only. Preserves the contract downstream effect predictors (varcode)
  depend on.
* `is_protein_coding_extended` - widens to IG/TR gene segments and
  translated pseudogenes (`polymorphic_pseudogene`,
  `translated_{processed,unprocessed}_pseudogene`). Excludes NMD/NSD
  because those products are targeted for degradation rather than
  stable accumulation.
* `is_translated` - widest tier. Adds `nonsense_mediated_decay` and
  `non_stop_decay` on top of the extended set. Useful when picking the
  top variant effect or doing RNA-seq peptide analysis where ribosome
  occupancy matters more than stable expression.

Set constants `PROTEIN_CODING_BIOTYPES`,
`EXTENDED_PROTEIN_CODING_BIOTYPES`, and `TRANSLATED_BIOTYPES` are
exposed on `pyensembl.locus_with_genome` so callers can derive their
own categorization if needed.
@coveralls

Copy link
Copy Markdown

Coverage Status

coverage: 85.494% (+0.07%) from 85.426% — fix-169-extended-protein-coding-flags into main

@iskandr iskandr merged commit bcfbeaa into main May 13, 2026
10 checks passed
@iskandr iskandr deleted the fix-169-extended-protein-coding-flags branch May 13, 2026 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support all protein coding biotypes

2 participants