Skip to content

Fix #335 (part 1): wire GENCODE_BIOTYPE_ALIASES into read_gtf call#356

Merged
iskandr merged 1 commit into
mainfrom
fix-335-part1-gencode-biotype-aliases
May 13, 2026
Merged

Fix #335 (part 1): wire GENCODE_BIOTYPE_ALIASES into read_gtf call#356
iskandr merged 1 commit into
mainfrom
fix-335-part1-gencode-biotype-aliases

Conversation

@iskandr

@iskandr iskandr commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes the GTF half of #335. With both this PR and #350 / v2.9.6 (versioned protein-ID FASTA matching) deployed, the original repro from the issue (Variant(...).effects() returning NoncodingTranscript for every transcript when wired to a GENCODE genome) finally produces correct Substitution effects.

Mechanism

gtfparse 2.7.0 (just released — openvax/gtfparse#67) introduced an attribute_aliases kwarg on read_gtf that renames source columns onto canonical names at parse time, plus a GENCODE_BIOTYPE_ALIASES constant for the two relevant GENCODE → Ensembl renames (gene_typegene_biotype, transcript_typetranscript_biotype). pyensembl now passes that constant unconditionally:

  • Vanilla Ensembl GTFs: alias sources don't exist, the rename is a no-op — gtfparse silently skips missing alias source columns.
  • GENCODE GTFs: gene_type / transcript_type get renamed to gene_biotype / transcript_biotype before they hit pyensembl's sqlite schema.
  • Mixed (both column variants present in same GTF): canonical wins, gtfparse logs a warning.

Dependency

Bumps gtfparse dep floor from >=2.6.0 to >=2.7.0GENCODE_BIOTYPE_ALIASES and attribute_aliases= are new in 2.7.0.

Out of scope

This is the biotype half of #335. The versioned-ID half was fixed in #350 / v2.9.6 (and refined again in #355 which is in flight). Together they're the full close.

Test plan

  • New tests/test_gencode_biotype_aliases.py covers:
    • GENCODE-style GTF (with gene_type / transcript_type) lands in the sqlite DB with biotype columns populated
    • Genome.genes(biotype="protein_coding") and transcripts(biotype="protein_coding") filter correctly against the renamed columns
    • Transcript.is_protein_coding returns True for coding rows and False for lncRNA rows — the exact thing the ID version handling and GENCODE compatibility #335 user repro was failing on
    • Vanilla Ensembl GTF (mouse partial fixture) is unaffected — alias pass-through is a no-op
  • pytest tests/test_gencode_biotype_aliases.py tests/test_mouse.py tests/test_tair10_complete.py tests/test_biotype_filter.py tests/test_versions.py tests/test_versioned_protein_fasta.py — 24 passed locally
  • ./lint.sh
  • CI on PR

Bumps version to 2.9.8.

GENCODE GTFs use gene_type / transcript_type where Ensembl uses
gene_biotype / transcript_biotype. Before this fix, pointing pyensembl
at a vanilla GENCODE GTF produced a sqlite database with empty biotype
columns, so Transcript.is_protein_coding returned False for every
transcript and downstream tools like varcode reported NoncodingTranscript
for every variant effect (the original #335 user repro).

gtfparse 2.7.0 introduced an attribute_aliases kwarg on read_gtf that
renames source columns onto canonical names at parse time, plus a
GENCODE_BIOTYPE_ALIASES constant for the two GENCODE->Ensembl renames.
pyensembl now passes that constant unconditionally:

* Pure Ensembl GTFs: alias sources don't exist, the rename is a no-op
  (gtfparse skips missing alias source columns).
* GENCODE GTFs: gene_type / transcript_type get renamed to
  gene_biotype / transcript_biotype before they hit the sqlite schema.
* Mixed (both column variants present): canonical wins, gtfparse warns.

Bumps gtfparse dep floor to >=2.7.0.

Companion to #350 / v2.9.6 (versioned protein-ID FASTA matching). With
both shipped, the original #335 GENCODE genome example finally produces
Substitution effects instead of NoncodingTranscript.

Bump version to 2.9.8.
@iskandr iskandr merged commit 339f07d into main May 13, 2026
10 checks passed
@iskandr iskandr deleted the fix-335-part1-gencode-biotype-aliases branch May 13, 2026 15:59
@coveralls

Copy link
Copy Markdown

Coverage Status

coverage: 84.971%. remained the same — fix-335-part1-gencode-biotype-aliases into main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants