Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 45 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Change Log

## [Unreleased]
## [v6.0.0](https://github.com/openvax/varcode/tree/v6.0.0) (2026-05-26)

**Fixed**
- `apply_variants_to_transcript` now refuses an insertion abutting
Expand All @@ -26,8 +26,52 @@
through `MolecularPhaseResolver(source)` and lives behind the optional
`varcode[rna]` / `pysam` dependency. `ReadPhaseResolver` remains as
the varcode 5.0 compatibility name.
- `SpliceOutcomeSet.effect_if_splicing_unchanged` — the canonical
"alternative outcome" accessor: the coding consequence that applies
if splicing proceeds normally (the `NormalSplicing` candidate's
`coding_effect`). Unlike the legacy `ExonicSpliceSite.alternate_effect`,
it works for intronic splice disruptions — returning `None` when
the nucleotide change leaves the protein untouched (i.e. there is
no coding consequence to attach). Sits alongside `most_likely_effect`
and `candidates` as the three-accessor surface on `SpliceOutcomeSet`
([#391](https://github.com/openvax/varcode/issues/391)).

**Breaking**
- `SpliceOutcomeSet` is now always-on for splice-disrupting variants
([#391](https://github.com/openvax/varcode/issues/391)). Every variant
that lands in the canonical splice window — `SpliceDonor`,
`SpliceAcceptor`, `ExonicSpliceSite`, `IntronicSpliceSite` — is
wrapped in a `SpliceOutcomeSet` carrying the candidate mechanisms.
Specifically:
- The `splice_outcomes=True` flag on `Variant.effects()` /
`VariantCollection.effects()` / `predict_variant_effects()` is
**removed**. Callers passing it explicitly get a `TypeError`.
Migration: drop the keyword — wrapping is unconditional.
- `Variant.effect_on_transcript(transcript)` and the
`FastEffectAnnotator` / `ProteinDiffEffectAnnotator` per-transcript
paths return a `SpliceOutcomeSet` for splice-disrupting variants
instead of the raw `ExonicSpliceSite` / `SpliceDonor` / etc. class.
Migration: replace `isinstance(effect, ExonicSpliceSite)` with
`isinstance(effect, SpliceOutcomeSet) and effect.disrupted_signal_class is ExonicSpliceSite`.
`effect.alternate_effect` still works as a back-compat alias for
`effect.effect_if_splicing_unchanged`, so attribute access keeps
working through the wrapper.
- `SpliceOutcomeSet.modifies_protein_sequence` is hardcoded to
`True` (a splice disruption is always *potentially* protein-
modifying via a non-`NormalSplicing` candidate). Closes a long-
standing filter bug where `drop_silent_and_noncoding()` silently
dropped exonic-splice-site variants whose `NormalSplicing.coding_effect`
happened to be `Silent`.
- Candidate construction is lazy: only the cheap `NormalSplicing`
candidate is built eagerly; `ExonSkipping`, `IntronRetention`,
`CrypticDonor` / `CrypticAcceptor` materialise on first
`.candidates` access. Pipelines that filter on
`modifies_protein_sequence` / `effect_priority` and never read
`.candidates` pay only the eager cost.
- `SpliceOutcomeSet` is now a `TranscriptMutationEffect` subclass
(alongside `MultiOutcomeEffect`), so it carries `gene` /
`transcript` and matches the standard `isinstance(effect, TranscriptMutationEffect)`
filter used by downstream consumers.
- Unified the multi-outcome machinery: `SpliceCandidate` deleted;
`MultiOutcomeEffect.outcomes` accessor + `_with_extra_outcomes`
helper + `_extra_outcomes` slot removed
Expand Down
31 changes: 17 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,30 +170,33 @@ and

### Splice-site disruption — *where* the signal was hit

DNA-level locations: these effects say a variant landed on or near a
splice signal, but **don't themselves carry a protein consequence** —
they say nothing about how the spliceosome responds (see the next
table for that). All four share the
DNA-level locations: these classes identify *where* a variant landed
in the canonical splice window. They no longer appear as top-level
effects — every splice-disrupting variant is wrapped in a
`SpliceOutcomeSet` (varcode 6.0+), and these classes survive as the
wrapper's `disrupted_signal_class` (a type) and as the
`splice_signal` reference on each candidate mechanism (an instance).
All four share the
[`SpliceSite`](https://github.com/openvax/varcode/blob/main/varcode/effects/effect_classes.py#:~:text=class%20SpliceSite%28)
base, so `from varcode import SpliceSite; isinstance(effect, SpliceSite)`
matches any of them. (The four leaf classes are exported from the
package root too.)
base.

| Effect type | Description |
| --- | --- |
| [`SpliceDonor`](https://github.com/openvax/varcode/blob/main/varcode/effects/effect_classes.py#:~:text=class%20SpliceDonor%28) | Mutation in the first two nucleotides of an intron, likely to affect splicing. |
| [`SpliceAcceptor`](https://github.com/openvax/varcode/blob/main/varcode/effects/effect_classes.py#:~:text=class%20SpliceAcceptor%28) | Mutation in the last two nucleotides of an intron, likely to affect splicing. |
| [`IntronicSpliceSite`](https://github.com/openvax/varcode/blob/main/varcode/effects/effect_classes.py#:~:text=class%20IntronicSpliceSite%28) | Mutation near the beginning or end of an intron but less likely to affect splicing than donor/acceptor mutations. |
| [`ExonicSpliceSite`](https://github.com/openvax/varcode/blob/main/varcode/effects/effect_classes.py#:~:text=class%20ExonicSpliceSite%28) | Mutation at the beginning or end of an exon, may affect splicing; itself a `MultiOutcomeEffect` wrapping the alternate exonic coding effect alongside the splice candidates. |
| [`SpliceDonor`](https://github.com/openvax/varcode/blob/main/varcode/effects/effect_classes.py#:~:text=class%20SpliceDonor%28) | Mutation at canonical donor `GT` (intronic +1/+2). |
| [`SpliceAcceptor`](https://github.com/openvax/varcode/blob/main/varcode/effects/effect_classes.py#:~:text=class%20SpliceAcceptor%28) | Mutation at canonical acceptor `AG` (intronic -2/-1). |
| [`IntronicSpliceSite`](https://github.com/openvax/varcode/blob/main/varcode/effects/effect_classes.py#:~:text=class%20IntronicSpliceSite%28) | Other intronic positions in the splice window (+3..+6 donor, -3 acceptor; also +1/+2 or -1/-2 when the reference signal isn't canonical). |
| [`ExonicSpliceSite`](https://github.com/openvax/varcode/blob/main/varcode/effects/effect_classes.py#:~:text=class%20ExonicSpliceSite%28) | Last 3 bases of an exon (donor side) or first base of an exon (acceptor side); changes a codon *and* disrupts the splice signal. Carries `alternate_effect` (the coding consequence if splicing proceeds). |

### Splice mechanism — *what the spliceosome does* in response

**These are the splice effects that carry a protein consequence.** The
protein-level outcome of a splice-signal hit is not deterministic from
DNA alone, so (when you opt in with `splice_outcomes=True`) varcode
emits these as candidates inside a
DNA alone, so varcode wraps every splice-signal disruption in a
[`SpliceOutcomeSet`](https://github.com/openvax/varcode/blob/main/varcode/splice_outcomes.py#:~:text=class%20SpliceOutcomeSet%28)
(a `MultiOutcomeEffect`). Each mechanism carries the originating
(a `MultiOutcomeEffect`) carrying these mechanisms as candidates.
Wrapping is always-on as of varcode 6.0 and lazy — only the cheap
`NormalSplicing` candidate is built eagerly; the rest materialise on
`.candidates` access. Each mechanism carries the originating
disruption on its `.splice_signal` attribute (a `SpliceSite`
*instance*), so you can always recover *where* the hit was off any
mechanism. The set also records the disruption's *class* on
Expand Down
245 changes: 185 additions & 60 deletions docs/effect_annotation.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,57 +47,124 @@ an `EffectCollection`. Each element is a `MutationEffect`
subclass — `Substitution`, `Silent`, `PrematureStop`, and so
on.

## Splice-disrupting variants: two representations
## Splice-disrupting variants

When a variant sits in the canonical splice window (last 3
exonic bases, first 3–6 intronic, canonical donor/acceptor),
varcode recognizes it as splice-disrupting. Two ways the
effect is expressed, at different richness levels.
A single nucleotide change near an exon-intron boundary can hit
the splice signal *and* the coding sequence at the same time.
The splice surface captures both possibilities, gives every
splice-disrupting variant a uniform candidate-set shape, and
exposes accessors for the "what if splicing still proceeds?"
question.

### Default: lightweight 2-outcome form
### When splice disruption is in play

```python
variant = Variant("17", 43082575 - 5, "C", "T", "GRCh38")
effect = variant.effect_on_transcript(transcript)
# ExonicSpliceSite(...)
# .alternate_effect -> Substitution(...) # if splicing proceeds
```
The classifier is **position-based**: it fires when a variant
lands in the canonical splice window around an exon-intron
boundary. The window is asymmetric — the donor consensus
(`MAG|GURAGU`) is wider on both sides than the acceptor
consensus (`YAG|R`):

`ExonicSpliceSite` carries `alternate_effect`: the coding
consequence that applies *if splicing still works*. Exactly
two outcomes, represented as a primary effect + one
alternate field. Cheap. Ships unconditionally.
- **exonic side**: the last 3 bases of an exon (donor side) or
the first base of the next exon (acceptor side)
- **intronic side**: positions +1..+6 of the intron (donor side)
and positions -3..-1 (acceptor side), including the canonical
`GT` at +1/+2 and `AG` at -2/-1

`SpliceDonor`, `SpliceAcceptor`, and `IntronicSpliceSite`
don't expose `alternate_effect` today because the variant
is intronic — there's no coding consequence to attach.
Four classes record *where* in this window the variant landed:

### Opt-in: full possibility set
| Class | Position |
|---|---|
| `ExonicSpliceSite` | Last 3 bases of an exon (donor side) or the first base of the next exon (acceptor side) |
| `SpliceDonor` | Canonical `GT` at intronic +1 / +2 |
| `SpliceAcceptor` | Canonical `AG` at intronic -2 / -1 |
| `IntronicSpliceSite` | Intronic +3..+6 (donor side) or -3 (acceptor side); also `+1/+2` or `-1/-2` when the reference base isn't the canonical `GT` / `AG` |

Variants outside this window are **not** flagged as
splice-disrupting, even when they may affect splicing
biologically — ESE/ESS motifs mid-exon, branch points ~20–50 bp
upstream of the acceptor, deep intronic cryptic activation.
Detecting those requires ML predictors or direct RNA evidence;
see [Limitations](#limitations).

### Splice and coding effects can co-occur

A variant in an exon sits on a coding base by definition — it
rewrites a codon. If that same exonic base is **also** in the
splice window (the exonic positions in the table above), the
same nucleotide change disrupts the splice signal *and* changes
the protein. varcode represents this duality as
**`ExonicSpliceSite`**:

- on the default 2-outcome shape, splice disruption is the
primary effect; the coding consequence (a `Substitution`,
`Silent`, etc.) hangs off `.alternate_effect`
- on the opt-in `SpliceOutcomeSet` shape, the same coding
consequence is the `coding_effect` of the `NormalSplicing`
candidate, reachable through
`splice_set.effect_if_splicing_unchanged`

For purely **intronic** disruptions (`SpliceDonor`,
`SpliceAcceptor`, `IntronicSpliceSite`), there is no codon to
rewrite — the variant doesn't change a coding base. The default
shape doesn't expose `alternate_effect` on these classes; the
opt-in shape's `effect_if_splicing_unchanged` returns `None`.

For coding variants **outside** the splice window, varcode emits
a plain coding effect (`Substitution`, `Silent`, `FrameShift`,
…) with no splice annotation attached. The variant may still
disrupt splicing through a non-canonical mechanism, but varcode
won't flag it — see Limitations.

### The `SpliceOutcomeSet` shape

Every splice-disrupting variant emits a `SpliceOutcomeSet` — there
is no "bare splice class" path at the user-facing API as of
varcode 6.0.

```python
effects = variant.effects(splice_outcomes=True)
# SpliceOutcomeSet(...) replaces the splice effect.
# .candidates is a tuple[EffectCandidate, ...], in producer order.
variant = Variant("17", 43082575 - 5, "C", "T", "GRCh38")
splice_set = variant.effect_on_transcript(transcript)
# SpliceOutcomeSet(disrupted_signal_class=ExonicSpliceSite, ...)
# .candidates is a tuple[EffectCandidate, ...] in producer order.
# Each candidate's .effect is a SpliceMechanismEffect subclass:
# EffectCandidate(effect=NormalSplicing(coding_effect=Substitution(...)))
# EffectCandidate(effect=ExonSkipping(affected_exon=..., in_frame=True,
# aa_ref="KGYK...", ...))
# EffectCandidate(effect=IntronRetention(retained_intron_start=...,
# side="donor", ...))
# EffectCandidate(effect=CrypticDonor(affected_exon=..., ...))
# EffectCandidate(effect=NormalSplicing(coding_effect=Substitution(...)))
```

`SpliceOutcomeSet` replaces the splice effect with a set of
candidate mechanisms. Class identity = mechanism — `NormalSplicing`,
`ExonSkipping`, `IntronRetention`, `CrypticDonor`, `CrypticAcceptor`.
Each is a `SpliceMechanismEffect` subclass that carries its own
protein vocab on the instance (`aa_ref`, `aa_alt`,
`mutant_protein_sequence`, `mutant_transcript`); these are `None`
when the protein math couldn't resolve (e.g. intron retention
without a genomic-sequence provider), populated otherwise. Each
mechanism also carries `splice_signal` — the underlying
`SpliceDonor` / `SpliceAcceptor` / `IntronicSpliceSite` /
`ExonicSpliceSite` effect describing *where* the disruption was.
`SpliceOutcomeSet` carries:

- `disrupted_signal_class` — the `SpliceSite` subclass (`SpliceDonor`,
`SpliceAcceptor`, `ExonicSpliceSite`, or `IntronicSpliceSite`)
identifying where in the splice window the variant landed
- `candidates` — a tuple of `EffectCandidate` objects in producer
order, one per plausible mechanism
- `effect_if_splicing_unchanged` — the coding consequence that
applies if the spliceosome still splices normally (the
`NormalSplicing` candidate's `coding_effect`), or `None` for
purely intronic disruptions where the nucleotide change doesn't
touch a coding base. Also exposed as `alternate_effect` for
back-compat with code that read `ExonicSpliceSite.alternate_effect`

Each candidate's `.effect` is a `SpliceMechanismEffect` subclass
that carries its own protein vocab on the instance (`aa_ref`,
`aa_alt`, `mutant_protein_sequence`, `mutant_transcript`). Fields
are `None` when the protein math couldn't resolve (e.g. intron
retention without a `genomic_sequence` provider), populated
otherwise. Each mechanism also exposes `splice_signal` — the
underlying raw `SpliceDonor` / `SpliceAcceptor` /
`IntronicSpliceSite` / `ExonicSpliceSite` effect describing *where*
the disruption was.

**Lazy construction.** Only the cheap `NormalSplicing` candidate
is built eagerly when the set is constructed; `ExonSkipping`,
`IntronRetention`, and `CrypticDonor`/`CrypticAcceptor` materialise
on first `.candidates` access and are cached. Filter pipelines
that drop variants early via `modifies_protein_sequence` /
`effect_priority` never trigger the expensive candidates.

Downstream consumers dispatch by class:

Expand All @@ -109,27 +176,87 @@ for c in splice_set.candidates:
print(c.effect.side, c.effect.retained_intron_start)
```

When you opt in, `SpliceDonor` / `SpliceAcceptor` /
`IntronicSpliceSite` also get wrapped, so every splice-
disrupting variant produces a `SpliceOutcomeSet`.
### Common questions

### Relationship between the two
A cheat sheet for the simple splice use cases. `splice_set` is a
`SpliceOutcomeSet` (every splice-disrupting variant produces one).

| # candidates | Class |
|---|---|
| 1 | plain `Substitution` / `Silent` / etc. — not wrapped |
| 2 | `ExonicSpliceSite` with `alternate_effect` |
| N | `SpliceOutcomeSet` (opt-in via `splice_outcomes=True`) |

Both `ExonicSpliceSite` and `SpliceOutcomeSet` are `MultiOutcomeEffect`
subclasses, so consumers iterate `.candidates` (a tuple of
`EffectCandidate` objects) uniformly without caring about which form
they're holding. `alternate_effect` works on both: on
`ExonicSpliceSite` it's the splicing-proceeds outcome directly; on
`SpliceOutcomeSet` it resolves to the inner effect of the
`NormalSplicing` candidate (or `None` when that candidate is just
a placeholder). `candidate.effect.short_description` is uniform
across both forms.
**Is this variant splice-disrupting?**

```python
from varcode import MultiOutcomeEffect, SpliceOutcomeSet

# Splice-specific check:
isinstance(effect, SpliceOutcomeSet)

# Or by disrupted signal class:
isinstance(effect, SpliceOutcomeSet) and effect.disrupted_signal_class is SpliceDonor

# Broader: any multi-outcome effect, including SV outcomes
# (LargeDeletion, GeneFusion, ...) — use when you want one
# uniform handler for splice + SV ambiguity.
isinstance(effect, MultiOutcomeEffect)
```

**What coding consequence applies if splicing still proceeds?**

```python
coding = splice_set.effect_if_splicing_unchanged # canonical
coding = splice_set.alternate_effect # back-compat alias

# Either returns the NormalSplicing candidate's coding_effect (a
# Substitution / Silent / PrematureStop / ...), or None for purely
# intronic disruptions where the variant doesn't change a coding base.
```

**What's the most likely splice mechanism?**

```python
splice_set.most_likely_effect # SpliceMechanismEffect
splice_set.most_likely_candidate # EffectCandidate (.effect + .source/.evidence)
```

**What are all candidate outcomes?**

```python
for candidate in splice_set.candidates:
candidate.effect # SpliceMechanismEffect (ExonSkipping, IntronRetention, ...)
candidate.source # producer name
candidate.evidence # opaque dict of provenance fields
```

**Which outcome is the most disruptive?**

```python
splice_set.highest_priority_effect # most protein-disruptive
splice_set.highest_priority_candidate
```

Use this for clinical / functional filtering ("flag if any
candidate is at least a frameshift") — a disruptive candidate
ranked below a less-disruptive primary should still light up.
See [Picking a single candidate](#picking-a-single-candidate)
for the "most likely" vs "most disruptive" distinction.

**What protein sequences could result?**

```python
splice_set.candidate_proteins # {ExonSkipping: "MA...", IntronRetention: "", ...}
splice_set.mutant_protein_sequences # set[str] of distinct non-empty sequences
```

Empty string means the mechanism's protein math couldn't resolve
(typically: no `genomic_sequence` provider, so `IntronRetention`
and `CrypticDonor` stay predicted-only).

**Where on the transcript is the splice signal?**

```python
for candidate in splice_set.candidates:
candidate.effect.splice_signal # SpliceDonor / SpliceAcceptor / IntronicSpliceSite / ExonicSpliceSite
```

### RNA evidence reconciliation

With RNA evidence, splice sets are reconciled rather than merely
extended. `SpliceOutcomeSet.with_rna_evidence(...)` returns a new set
Expand Down Expand Up @@ -173,12 +300,10 @@ primary candidate should still light up.

### Limitations

The splice classifier is **position-based** — it fires on the
canonical window (last 3 exonic, first 3-6 intronic, donor/acceptor)
and nothing else. Sequence-based signals are not flagged: exonic
splicing enhancer/silencer disruption mid-exon (~6-10nt SR-protein
motifs), branch points (~20-50nt upstream of the acceptor), deep
intronic cryptic sites. Detecting these needs ML predictors (SpliceAI,
Sequence-based splice signals are not flagged: exonic splicing
enhancer/silencer disruption mid-exon (~6-10nt SR-protein motifs),
branch points (~20-50nt upstream of the acceptor), deep intronic
cryptic sites. Detecting these needs ML predictors (SpliceAI,
Pangolin, MMSplice, SpliceTransformer) or direct RNA evidence;
tracked in [#297][i297].

Expand Down
Loading
Loading