Skip to content

Support VCF symbolic alleles and breakends as extended variant types #264

@iskandr

Description

@iskandr

Background

VCF 4.0+ allows alternate alleles that are not literal nucleotide strings:

Symbolic alleles — placeholders inside angle brackets, with detail supplied in INFO:

  • <DEL> — deletion
  • <DUP> — duplication (tandem, dispersed)
  • <INS> — insertion of unspecified sequence
  • <INV> — inversion
  • <CN0>, <CN1>, <CN2>, <CN3>, ... — copy number states
  • <INS:ME:ALU>, <INS:ME:LINE1>, <INS:ME:SVA> — mobile element insertions
  • <DEL:ME:ALU> — deletion of a mobile element

Breakend (BND) notation — two-ended rearrangements joining distant loci:

  • G]17:198982] — G joined to position 17:198982, orientations encoded by bracket direction
  • ]17:198982]G, [13:123456[T, T[13:123456[ — variants for different orientations

Spanning deletion placeholder:

  • * — indicates an allele deleted by an upstream variant

Current state (after #88 is fixed in PR #XXX)

load_vcf() now detects these and skips them with a visible warning instead of crashing. This preserves the rest of the VCF but silently drops real variants that a user might care about.

Desired state

Represent these alleles as first-class variant types so downstream code (effect prediction, filtering, annotation export) can reason about them. This ties directly to the in-progress structural variant work:

Implementation sketch

  1. Parse INFO fields for SV context: SVTYPE, END, SVLEN, CIPOS, CIEND, MATEID, CHR2/POS2, INSSEQ, etc.
  2. Dispatch table for symbolic alleles:
  3. Breakend pairing: match MATEID pairs and construct a single Translocation/Breakpoint (Add structural variant types (translocations, inversions, duplications, breakpoints) #257) from the pair rather than two half-breakends.
  4. Spanning deletion *: represent as a reference to the variant that consumed the position, or skip with metadata (it's a placeholder, not an independent variant).

Relation to existing issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions