Skip to content

Error with StopLoss variant effect with an empty 3' UTR #394

@ccwang002

Description

@ccwang002

Thanks for making & maintaining this wonderful tool. I encountered a large in-frame deletion (StopLoss) at 16:30128111, which caused varcode to raise a ValueError that no amino acids were added.

Steps to reproduce

Environment:

  • Python v3.12
  • Varcode v6.0.0
  • Ensembl release 75 (GRCh37) pyensembl install --release 75
>>> from pyensembl import ensembl_grch37 
>>> import varcode
>>> variant = varcode.Variant(contig='16', start=30128150, ref='GGGATGCCTACGTGCCCCC', alt='G', ensembl=ensembl_grch37)
>>> variant.effects()
INFO:varcode.effects.effect_prediction:Mutation in variant Variant(contig='16', start=30128152, ref='GGATGCCTACGT', alt='', reference_name='GRCh37') starts before exon Exon(exon_id='ENSE00001520895', gene_id='ENSG00000102882', gene_name='MAPK3', contig='16', start=30128158, end=30128324, strand='-')
INFO:pyensembl.sequence_data:Loaded sequence dictionary from cache/pyensembl/GRCh37/ensembl75/Homo_sapiens.GRCh37.75.pep.all.fa.gz.pickle
Traceback (most recent call last):                                                            
  File "<stdin>", line 1, in <module>
  File "python3.12/site-packages/varcode/variant.py", line 513, in effects
    effects = predict_variant_effects(                                                        
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "python3.12/site-packages/varcode/effects/effect_prediction.py", line 170, in predict_variant_effects    
    effect = annotate(variant, transcript)          
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                  
  File "python3.12/site-packages/varcode/annotators/protein_diff.py", line 104, in annotate_on_transcript       
    fast_effect = FastEffectAnnotator().annotate_on_transcript(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                             
  File "python3.12/site-packages/varcode/annotators/fast.py", line 57, in annotate_on_transcript                           
    return _predict_variant_effect_on_transcript_raw(variant, transcript)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                   
  File "python3.12/site-packages/varcode/effects/effect_prediction.py", line 323, in _predict_variant_effect_on_transcript_raw
    exonic_effect_annotation = exonic_transcript_effect(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python3.12/site-packages/varcode/effects/effect_prediction.py", line 552, in exonic_transcript_effect
    coding_effect_annotation = predict_variant_coding_effect_on_transcript(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File python3.12/site-packages/varcode/effects/effect_prediction_coding.py", line 114, in predict_variant_coding_effect_on_transcript
    return predict_in_frame_coding_effect(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "python3.12/site-packages/varcode/effects/effect_prediction_coding_in_frame.py", line 289, in predict_in_frame_coding_effect
    return StopLoss(
           ^^^^^^^^^
  File "python3.12/site-packages/varcode/effects/effect_classes.py", line 1254, in __init__
    raise ValueError(
ValueError: If no amino acids added by StopLoss then it should be Silent

Investigation and discussion

When varcode gets this StopLoss variant of transcript ENST00000395199, it tries to fetch its 3' UTR sequence. But for this particular annotation, it doesn't have a 3' UTR sequence (empty as transcript.three_prime_utr_sequence == ''), so variant crashes because it tries to predict aa change using the empty UTR seq.

After some code tracing, I believe the relevant code are:

Relevant code snippets
# In varcode/effects/effect_classes.py
class StopLoss(KnownAminoAcidChange):
    def __init__(
            self,
            variant,
            transcript,
            aa_ref,
            aa_alt):
        # StopLoss assumes that we deleted some codons ending with a
        # stop codon
        if "*" in aa_ref:
            raise ValueError(
                "StopLoss aa_ref '%s' should not contain '*'" % (
                    aa_ref,))
        if len(aa_alt) == 0:
            raise ValueError(
                "If no amino acids added by StopLoss then it should be Silent")
# In varcode/effects/effect_prediction_coding_in_frame.py
def predict_in_frame_coding_effect(...):
    if ... :
        ...
    elif using_three_prime_utr:
        # if non-silent mutation is at the end of the protein then
        # should be a stop-loss
        return StopLoss(
            variant,
            transcript,
            aa_ref=aa_ref,
            aa_alt=aa_alt)
    elif n_aa_alt == 0:
        return Deletion(
            variant,
            transcript,
            aa_mutation_start_offset=aa_mutation_start_offset,
            aa_ref=aa_ref)

In this situation, maybe this variant effect could be a Deletion() instead? A naive check is to create a Deletion() if 3' UTR is empty, though it's theoretically possible to have a very short UTR and still fails the len(aa_alt) == 0 assertion.

See also

A similar issue (and variant) has been reported in #246 but the proposed PR didn't address the potential scenario of an empty 3' UTR sequence.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions