Fix #335 (part 1): wire GENCODE_BIOTYPE_ALIASES into read_gtf call#356
Merged
Conversation
GENCODE GTFs use gene_type / transcript_type where Ensembl uses gene_biotype / transcript_biotype. Before this fix, pointing pyensembl at a vanilla GENCODE GTF produced a sqlite database with empty biotype columns, so Transcript.is_protein_coding returned False for every transcript and downstream tools like varcode reported NoncodingTranscript for every variant effect (the original #335 user repro). gtfparse 2.7.0 introduced an attribute_aliases kwarg on read_gtf that renames source columns onto canonical names at parse time, plus a GENCODE_BIOTYPE_ALIASES constant for the two GENCODE->Ensembl renames. pyensembl now passes that constant unconditionally: * Pure Ensembl GTFs: alias sources don't exist, the rename is a no-op (gtfparse skips missing alias source columns). * GENCODE GTFs: gene_type / transcript_type get renamed to gene_biotype / transcript_biotype before they hit the sqlite schema. * Mixed (both column variants present): canonical wins, gtfparse warns. Bumps gtfparse dep floor to >=2.7.0. Companion to #350 / v2.9.6 (versioned protein-ID FASTA matching). With both shipped, the original #335 GENCODE genome example finally produces Substitution effects instead of NoncodingTranscript. Bump version to 2.9.8.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the GTF half of #335. With both this PR and #350 / v2.9.6 (versioned protein-ID FASTA matching) deployed, the original repro from the issue (
Variant(...).effects()returningNoncodingTranscriptfor every transcript when wired to a GENCODE genome) finally produces correctSubstitutioneffects.Mechanism
gtfparse 2.7.0 (just released — openvax/gtfparse#67) introduced an
attribute_aliaseskwarg onread_gtfthat renames source columns onto canonical names at parse time, plus aGENCODE_BIOTYPE_ALIASESconstant for the two relevant GENCODE → Ensembl renames (gene_type→gene_biotype,transcript_type→transcript_biotype). pyensembl now passes that constant unconditionally:gene_type/transcript_typeget renamed togene_biotype/transcript_biotypebefore they hit pyensembl's sqlite schema.Dependency
Bumps
gtfparsedep floor from>=2.6.0to>=2.7.0—GENCODE_BIOTYPE_ALIASESandattribute_aliases=are new in 2.7.0.Out of scope
This is the biotype half of #335. The versioned-ID half was fixed in #350 / v2.9.6 (and refined again in #355 which is in flight). Together they're the full close.
Test plan
tests/test_gencode_biotype_aliases.pycovers:gene_type/transcript_type) lands in the sqlite DB with biotype columns populatedGenome.genes(biotype="protein_coding")andtranscripts(biotype="protein_coding")filter correctly against the renamed columnsTranscript.is_protein_codingreturns True for coding rows and False for lncRNA rows — the exact thing the ID version handling and GENCODE compatibility #335 user repro was failing onpytest tests/test_gencode_biotype_aliases.py tests/test_mouse.py tests/test_tair10_complete.py tests/test_biotype_filter.py tests/test_versions.py tests/test_versioned_protein_fasta.py— 24 passed locally./lint.shBumps version to 2.9.8.