Thanks for making & maintaining this wonderful tool. I encountered a large in-frame deletion (StopLoss) at 16:30128111, which caused varcode to raise a ValueError that no amino acids were added.
Steps to reproduce
Environment:
- Python v3.12
- Varcode v6.0.0
- Ensembl release 75 (GRCh37)
pyensembl install --release 75
>>> from pyensembl import ensembl_grch37
>>> import varcode
>>> variant = varcode.Variant(contig='16', start=30128150, ref='GGGATGCCTACGTGCCCCC', alt='G', ensembl=ensembl_grch37)
>>> variant.effects()
INFO:varcode.effects.effect_prediction:Mutation in variant Variant(contig='16', start=30128152, ref='GGATGCCTACGT', alt='', reference_name='GRCh37') starts before exon Exon(exon_id='ENSE00001520895', gene_id='ENSG00000102882', gene_name='MAPK3', contig='16', start=30128158, end=30128324, strand='-')
INFO:pyensembl.sequence_data:Loaded sequence dictionary from cache/pyensembl/GRCh37/ensembl75/Homo_sapiens.GRCh37.75.pep.all.fa.gz.pickle
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "python3.12/site-packages/varcode/variant.py", line 513, in effects
effects = predict_variant_effects(
^^^^^^^^^^^^^^^^^^^^^^^^
File "python3.12/site-packages/varcode/effects/effect_prediction.py", line 170, in predict_variant_effects
effect = annotate(variant, transcript)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "python3.12/site-packages/varcode/annotators/protein_diff.py", line 104, in annotate_on_transcript
fast_effect = FastEffectAnnotator().annotate_on_transcript(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "python3.12/site-packages/varcode/annotators/fast.py", line 57, in annotate_on_transcript
return _predict_variant_effect_on_transcript_raw(variant, transcript)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "python3.12/site-packages/varcode/effects/effect_prediction.py", line 323, in _predict_variant_effect_on_transcript_raw
exonic_effect_annotation = exonic_transcript_effect(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "python3.12/site-packages/varcode/effects/effect_prediction.py", line 552, in exonic_transcript_effect
coding_effect_annotation = predict_variant_coding_effect_on_transcript(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File python3.12/site-packages/varcode/effects/effect_prediction_coding.py", line 114, in predict_variant_coding_effect_on_transcript
return predict_in_frame_coding_effect(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "python3.12/site-packages/varcode/effects/effect_prediction_coding_in_frame.py", line 289, in predict_in_frame_coding_effect
return StopLoss(
^^^^^^^^^
File "python3.12/site-packages/varcode/effects/effect_classes.py", line 1254, in __init__
raise ValueError(
ValueError: If no amino acids added by StopLoss then it should be Silent
Investigation and discussion
When varcode gets this StopLoss variant of transcript ENST00000395199, it tries to fetch its 3' UTR sequence. But for this particular annotation, it doesn't have a 3' UTR sequence (empty as transcript.three_prime_utr_sequence == ''), so variant crashes because it tries to predict aa change using the empty UTR seq.
After some code tracing, I believe the relevant code are:
Relevant code snippets
# In varcode/effects/effect_classes.py
class StopLoss(KnownAminoAcidChange):
def __init__(
self,
variant,
transcript,
aa_ref,
aa_alt):
# StopLoss assumes that we deleted some codons ending with a
# stop codon
if "*" in aa_ref:
raise ValueError(
"StopLoss aa_ref '%s' should not contain '*'" % (
aa_ref,))
if len(aa_alt) == 0:
raise ValueError(
"If no amino acids added by StopLoss then it should be Silent")
# In varcode/effects/effect_prediction_coding_in_frame.py
def predict_in_frame_coding_effect(...):
if ... :
...
elif using_three_prime_utr:
# if non-silent mutation is at the end of the protein then
# should be a stop-loss
return StopLoss(
variant,
transcript,
aa_ref=aa_ref,
aa_alt=aa_alt)
elif n_aa_alt == 0:
return Deletion(
variant,
transcript,
aa_mutation_start_offset=aa_mutation_start_offset,
aa_ref=aa_ref)
In this situation, maybe this variant effect could be a Deletion() instead? A naive check is to create a Deletion() if 3' UTR is empty, though it's theoretically possible to have a very short UTR and still fails the len(aa_alt) == 0 assertion.
See also
A similar issue (and variant) has been reported in #246 but the proposed PR didn't address the potential scenario of an empty 3' UTR sequence.
Thanks for making & maintaining this wonderful tool. I encountered a large in-frame deletion (StopLoss) at 16:30128111, which caused varcode to raise a ValueError that no amino acids were added.
Steps to reproduce
Environment:
pyensembl install --release 75Investigation and discussion
When varcode gets this StopLoss variant of transcript
ENST00000395199, it tries to fetch its 3' UTR sequence. But for this particular annotation, it doesn't have a 3' UTR sequence (empty astranscript.three_prime_utr_sequence == ''), so variant crashes because it tries to predict aa change using the empty UTR seq.After some code tracing, I believe the relevant code are:
Relevant code snippets
In this situation, maybe this variant effect could be a
Deletion()instead? A naive check is to create aDeletion()if 3' UTR is empty, though it's theoretically possible to have a very short UTR and still fails thelen(aa_alt) == 0assertion.See also
A similar issue (and variant) has been reported in #246 but the proposed PR didn't address the potential scenario of an empty 3' UTR sequence.