Skip to content

[codex] Add pepsickle subprocess isolation#201

Merged
iskandr merged 2 commits into
masterfrom
pepsickle-subprocess-isolation
May 4, 2026
Merged

[codex] Add pepsickle subprocess isolation#201
iskandr merged 2 commits into
masterfrom
pepsickle-subprocess-isolation

Conversation

@iskandr

@iskandr iskandr commented May 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds opt-in subprocess isolation for mhctools.Pepsickle via Pepsickle(isolate_subprocess=True). The isolated path sends unique input sequences plus the existing human_only and threshold options to a short-lived Python subprocess over JSON stdin/stdout, so pepsickle can load torch without sharing the parent process already-loaded OpenMP runtimes.

The processing predictor base now has a cleavage_probs_many hook, used by predict, predict_proteins, and predict_cleavage_sites, so isolated pepsickle calls batch unique sequences into one subprocess invocation while preserving the existing cleavage_probs API.

Also bumps mhctools.__version__ from 3.13.2 to 3.13.3.

Fixes #200.

Validation

  • ./lint.sh
  • ./test.sh (397 passed, 9 skipped, 2 xfailed)

@iskandr iskandr marked this pull request as ready for review May 4, 2026 13:02
@iskandr iskandr merged commit 7823207 into master May 4, 2026
4 checks passed
@iskandr iskandr deleted the pepsickle-subprocess-isolation branch May 4, 2026 13:03
iskandr added a commit to openvax/vaxrank that referenced this pull request May 4, 2026
…tion

mhctools 3.13.3 (openvax/mhctools#201) ships built-in
``Pepsickle(isolate_subprocess=True)`` so the libomp-clash workaround
collapses from a vaxrank-side launcher + bespoke subprocess module
to a single kwarg.

Removed:
- ``vaxrank/_pepsickle_subprocess.py`` (the pure-mhctools subprocess
  body lifted to ``openvax/mhctools/_pepsickle_subprocess.py``).
- ``processing._cleavage_probs_via_subprocess`` and
  ``_SUBPROCESS_MODULE`` (was launching the deleted module).
- The two unit tests that exercised the local module's I/O contract.

Added:
- ``processing._load_default_predictor(human_only, threshold)`` —
  thin builder that constructs ``Pepsickle(isolate_subprocess=True)``
  and degrades to None with a logged warning if mhctools isn't
  importable. ``annotate_processing`` now resolves a default
  predictor through this helper and calls ``cleavage_probs`` directly
  on it (same loop body that the test-seam ``predictor=`` arg has
  always used).
- ``test_load_default_predictor_returns_none_when_mhctools_missing``
  pins the graceful-degrade path.

Bumped:
- ``requirements.txt``: mhctools >=3.13.3 (was >=3.13.1).

End-to-end on Pt02 LENS: 2105/2113 EpitopePredictions annotated
across 489 unique sources — same numbers as before the refactor,
confirming the subprocess invocation moved upstream without changing
results.

Test count: 674 (was 675; net −1 from removing the two
local-subprocess tests + adding one default-predictor smoke).
iskandr added a commit to openvax/vaxrank that referenced this pull request May 5, 2026
…otation (#262)

* Pepsickle credibility tagging (#249) + LENS field-format fixes (#259, #260)

#249 — Pepsickle credibility tagging
------------------------------------
mhcflurry-presentation already includes a flank-aware processing prior,
so a per-k-mer "presentation score" implicitly captures whether the
ligand would be cleaved out and presented. What it can't tell us:
whether the proteasome would cut INSIDE the ligand (destroying it
before MHC) or whether the C-terminal cut at the ligand's boundary is
clean. Pepsickle (Weeder et al., Bioinformatics 2021) gives
per-position cleavage probabilities — we use that as a credibility
filter on existing MHC ligand predictions, not as a separate score
axis.

Each EpitopePrediction gains three optional fields, populated when
``--processing-aware-annotation`` is enabled (default: on):
  - c_term_cleavage_prob: prob of clean release at the C-terminus
  - max_internal_cut_prob: peak cleavage prob strictly inside the
    ligand (high → ligand likely destroyed before MHC)
  - processing_score: composite c_term × (1 − max_internal); 1.0 ideal

Annotation is purely additive — never alters ranking. Surfaced in
HTML / PDF / ASCII per-epitope tables as three extra columns when
populated; absent reports keep their original 6-column shape.

New module: vaxrank/processing.py
  - annotate_processing(predictions, predictor=None) — runs pepsickle
    once per unique source_sequence (not once per peptide); the
    predictor is loaded lazily and degrades gracefully if unavailable
  - _per_position_processing / _composite_processing_score helpers

CLI flag: --processing-aware-annotation / --no-processing-aware-annotation
(default ON).

#259 + #260 — LENS field-format bug fixes
------------------------------------------
Surfaced while running on a real LENS v1.9-dev file (Hugo IPRES Pt10):

#259: variant_coords parser only handled 4-part chr:pos:ref:alt; real
LENS files emit 2-part chr:pos (no ref/alt), NaN (non-SNV antigen
rows: splice / fusion / intron retention), and the literal string
"nan". Pre-fix: 3000+ warnings, 0 ranked entries. Post-fix: parses
all real-LENS forms, NaN rows skipped silently (info-level summary
instead of per-row warning), unparseable rows aggregated into one
warning. Placeholder ref/alt nucleotides ('A'/'C') used when LENS
omits them — varcode rejects 'N' and empty strings.

#260: write_neoepitope_report raised ValueError on duplicate
(peptide, allele) rows. Real LENS files emit duplicates (same
neoepitope from multiple transcripts / homologous regions). The
score-merge already broadcasts the same score across duplicate rows
correctly; the strict-uniqueness check was overly strict. Replaced
with an info-level note documenting the duplicate count.

Tests (+19 new)
---------------
- test_processing.py: 14 tests covering helpers, annotation, ranking
  invariance, multi-source one-pass-per-source, off-by-one offset
  re-location, predictor failure degradation, and CLI flag default.
- test_external_input.py: 5 new tests covering the relaxed
  variant_coords parser (2-part, 3-part, NaN, 'N' rejection) and a
  regression test against the real-LENS-v1.9 fixture.
- test_epitope_io.py: replaced the strict-duplicates rejection test
  with one that pins score-broadcasting across duplicate rows.

Real-world end-to-end: a real Hugo IPRES Pt10 LENS file now produces
both --output-csv and --output-neoepitope-report XLSX without errors
(was crashing pre-fix).

Bumps 2.15.0 → 2.16.0.

Subsumes #259, #260.

* Review fixes: HTML header drift, placeholder-genotype provenance,
closest-occurrence re-location, dedup contract, pepsickle CLI passthrough

#1 (correctness) HTML/PDF/ASCII per-epitope table header drift
--------------------------------------------------------------
Pre-fix: _epitope_data emitted a 6-key dict for unannotated predictions
and a 9-key dict for annotated ones. The HTML template builds the
header from p.epitopes[0].keys() — if the first epitope happened to be
unannotated and a later one was annotated, the header had 6 columns
but rows had 9 → malformed table. This was reachable any time pepsickle
annotation succeeded for some sources but failed for others (graceful
per-source skip is a documented behavior).

Fix: _epitope_data takes an ``include_processing`` parameter; the
caller computes ``any(c_term is not None for p in mutant_predictions)``
per VaccinePeptide and passes the result. When True, every row gets
the 9-column shape (with '—' fallback for unannotated predictions),
matching the header. When False, the legacy 6-column shape is
returned unchanged so unannotated reports don't change.

#2 (provenance) ref_alt_fallback placeholder genotype
-----------------------------------------------------
_parse_variant_coords now returns ``(Variant, alleles_real)`` so
callers can detect when the genotype is fictional (placeholder
ref/alt because LENS only emitted chr:pos). When alleles_real is
False the synthesized Variant must NOT be fed to varcode-effect
annotation — its genotype isn't real biology. Documented in the
docstring + an info-level summary log line counts how many such
placeholders were used per LENS run. Construct assembly only uses
(chr, pos) and is unaffected.

#3 (correctness) Re-location picks the occurrence closest to the
declared offset, not the first match
-------------------------------------------------------------------
When the declared offset of a peptide doesn't match the source at
that position, processing.py re-locates by substring search. Pre-fix
this used source.find() — always returning the first occurrence even
if the declared offset pointed at a later match (e.g. homopolymer
tracts with repeated short ligands). New _closest_occurrence helper
scans all matches and picks the one with smallest |pos - declared|.
A drift > 3 positions emits a warning so upstream loaders with
broken offset accounting are flagged.

#4 (UX) CLI passthrough for pepsickle parameters
-------------------------------------------------
New flags: --pepsickle-human-only (use the human-only-trained model)
and --pepsickle-threshold N (cleavage probability threshold). Threaded
through annotate_processing → _load_default_predictor.

#6 (clarity) Dup-rows log message
----------------------------------
Was "from multi-source LENS / pVACseq input"; pipeline-mode duplicates
(if a future loader bug introduces them) would be misattributed. Now
"from multi-source input."

#7 (debuggability) _load_default_predictor warns + debug-traceback
-------------------------------------------------------------------
Pepsickle import / instantiation failures previously logged a single
warning line. Now also logger.debug(exc_info=True) so genuine bugs
(bad CUDA libs, torch version mismatch) surface at DEBUG without
spamming WARNING.

#8 (robustness) Dedup contract in _annotate_predictions_with_processing
------------------------------------------------------------------------
Pre-fix dedup keyed on id(p), which assumes the same EpitopePrediction
object is reachable from both the ranked-vaccine-peptides intermediate
and the LENS predictions list. That's true for the current external_input
loader but a future loader that copies predictions into VaccinePeptides
would break the assumption and double-annotate. Belt-and-suspenders:
also dedup on the content tuple (peptide_sequence, allele,
source_sequence, offset). Both checks are O(1).

Tests (+5 new)
--------------
- test_epitope_data_header_consistent_when_some_predictions_unannotated:
  pins #1 — annotated and unannotated rows produce identical key sets
  when include_processing=True
- test_re_location_picks_closest_to_declared_offset: pins #3 —
  homopolymer source where declared offset points at the second
  occurrence; re-location must snap there, not to the first
- test_re_location_warns_on_large_offset_drift: pins #3's >3aa warning
- test_dedup_by_content_when_duplicate_objects: pins #8 — two distinct
  Python objects with the same (peptide, allele, source, offset)
  collapse to a single annotation pass
- test_pepsickle_cli_param_passthrough: pins #4 — --pepsickle-human-only
  / --pepsickle-threshold parse correctly

Plus updated tests for the (Variant, alleles_real) tuple return shape
of _parse_variant_coords (5 existing tests adapted; one new alleles_real
assertion per case).

667 tests pass; 93% coverage.

* Stop-codon truncation, placeholder-genotype propagation, parametrized real-LENS CI, more review fixes

LENS data quality (real-world Pt02 file from HugoLo IPRES 2016)
---------------------------------------------------------------
Pt02 surfaced a third LENS bug: pep_context can contain '*' (stop
codon) for stop-loss / readthrough variants. Pre-fix this crashed
manufacturability scoring with KeyError: '*'.

Fix: new helpers in vaccine_library.py:
  - truncate_at_stop_codon(aa) — stop at first '*'; everything after
    doesn't exist as protein. Translation halts at the stop.
  - has_only_standard_amino_acids(aa) — guard against selenocysteine
    'U', pyrrolysine 'O', ambiguous 'X' / 'B' / 'Z' / 'J', etc.
external_input.ranked_from_lens_predictions now truncates pep_context
+ peptide at first stop and drops rows where the result has any
non-standard residue, with a warning.

New fixture: tests/data/epitope_fixtures/real_lens_subsets/
  lens_v1.9_with_stop_codons.tsv (52 rows from a real Pt02 dump)
covers stop-codon, NaN variant_coords, multi-row-per-variant, and
2-part chr:pos shapes in one file.

Parametrized end-to-end CI (no more "shouldn't have to run manually")
----------------------------------------------------------------------
test_real_lens_fixture_runs_end_to_end_via_cli is parametrized over
every TSV in tests/data/epitope_fixtures/real_lens_subsets/. Each
fixture drives main() with --input-lens / --output-csv /
--output-neoepitope-report end-to-end. Future LENS shapes get
covered by dropping a fixture in that directory; the test grows
automatically. test_real_lens_fixtures_present pins the expected set.

Review fixes (continuing from PR review)
-----------------------------------------
#1 alleles_real propagation. Added MutantProteinFragment.placeholder_alleles
(default False). LENS path sets it True when pep_context came from a
2-part chr:pos row (no real ref/alt). Documented that downstream
varcode-effect annotation MUST detect this flag rather than silently
do the wrong thing on the placeholder genotype.

#2 pVACseq tuple consistency. _parse_pvacseq_id now returns
(Variant, alleles_real) matching the LENS-path shape. pVACseq IDs
always carry real ref/alt so alleles_real is True on success;
the symmetric return shape lets future code handle both paths
uniformly.

#3 Empty peptide_sequence dedup degeneration. _maybe_add now skips
content-key dedup for empty-peptide predictions (passes them
through; the annotation step skips them on its own).

#4 Drift threshold scales with source length: max(3, 5% × len).
Typical 25-aa SLP sources keep the absolute threshold; full-protein
sources allow more drift before warning.

#5 annotate_processing warns when caller passes a pre-built
predictor AND non-default human_only / threshold (the params are
ignored — caller already configured the predictor).

#7 Test refactor: monkeypatch fixture replaces manual try/finally
in test_dedup_by_content_when_duplicate_objects so cleanup is
automatic on test failure.

Documentation
-------------
README gains a Data Model section explaining VaccinePeptide /
MutantProteinFragment / EpitopePrediction structure, the
variant → multiple VPs → multiple constructs fan-out, and where
each per-VP report renders. Removes the "what's a VP?" reviewer
trip-up.

Tests (+10 new)
---------------
- 4× parametrized end-to-end real-LENS CLI (one per fixture)
- test_real_lens_fixtures_present (coverage gate)
- test_lens_pep_context_with_stop_codon_truncates
- test_truncate_at_stop_codon_helper
- test_has_only_standard_amino_acids_helper
- test_annotate_processing_warns_when_predictor_and_params_supplied
- test_drift_threshold_scales_with_source_length

677 tests pass; 93% coverage. Real Pt02 LENS file produces both
CSV + XLSX outputs end-to-end (was crashing pre-fix on stop codons).

* LENS: use real snv_*_allele / indel_*_allele columns instead of placeholder ref/alt

The previous fix substituted A/C placeholder nucleotides when LENS
variant_coords arrived as 2-part chr:pos — but LENS files actually
DO carry real ref/alt, just in dedicated per-source columns:

  SNV rows   →  snv_ref_allele, snv_alt_allele       ('C', '[T]')
  INDEL rows →  indel_ref_allele, indel_alt_allele   ('CA', '[C]')

The brackets are LENS's notation (multi-allelic-ready, though we've
seen only single-allele cases in real files). The dedicated columns
are populated for every SNV / INDEL row in the patient datasets we
have, so there's no need to invent placeholder genotypes. Inventing
a fictional A→C variant was a hack and would have caused varcode-
effect annotation downstream to do the wrong thing silently.

Architecture
------------
- ``_parse_variant_coords(coords)`` now extracts only ``(contig, pos)``
  — a focused chr:pos parser. NaN / empty / malformed → None.
- New ``_strip_lens_allele(value)`` strips the bracket notation
  (``'[T]'`` → ``'T'``).
- New ``_variant_from_lens_row(row, genome=None)`` is the row-aware
  helper: extracts (contig, pos) from variant_coords AND looks up the
  real ref/alt from snv_*_allele or indel_*_allele based on
  ``antigen_source``. Returns the real-genotype Variant or None.
- ``ranked_from_lens_predictions`` now calls ``_variant_from_lens_row``
  on the representative row of each variant_coords group; placeholder
  genotypes are never produced.
- ``MutantProteinFragment.placeholder_alleles`` is now always False
  on the LENS path. Field retained for hypothetical future loaders
  that genuinely lack alleles.

Real Pt02 file (Hugo IPRES 2016) end-to-end
--------------------------------------------
Before: 0 ranked entries (parser couldn't fabricate alleles cleanly).
After:  409 ranked entries with REAL ref/alt:
  chr3:150742446 A>      SIAH2   (CA → C frameshift, varcode-normalized)
  chr17:29502684 T>      TAOK1
  chr6:56459264  G>A     DST     (missense SNV)

Test fixtures
-------------
Added snv_ref_allele / snv_alt_allele / indel_ref_allele /
indel_alt_allele columns to the toy LENS fixtures (lens_example.tsv,
lens_multi_row_per_variant.tsv, lens_v1.4_with_stability.tsv,
lens_v1.9_mhcflurry_only.tsv). The 4 real-LENS subset fixtures
already had these columns — no changes needed.

Tests (+5 new, 4 updated)
-------------------------
- test_parse_variant_coords_extracts_chr_pos: pins the new (chr, pos)-
  only return shape.
- test_parse_variant_coords_returns_none_on_missing_or_malformed:
  NaN / 'nan' / '' / garbage all return None.
- test_strip_lens_allele_handles_bracket_form: '[T]' / '[CA]' / 'T' /
  None / 'nan' edge cases.
- test_variant_from_lens_row_uses_real_snv_alleles: SNV path uses
  snv_ref_allele / snv_alt_allele.
- test_variant_from_lens_row_uses_real_indel_alleles: INDEL path uses
  indel_*_allele; pins the varcode normalization (CA → C becomes
  ref='A' alt='' at start+1).
- test_variant_from_lens_row_skips_non_snv_indel: SPLICE / FUSION /
  CTA-SELF / ERV → None (caller skipped earlier on NaN coords anyway).
- test_variant_from_lens_row_skips_when_alleles_missing: defends the
  hypothetical future case of an SNV row missing its allele columns.

Removed: 5 tests that asserted the placeholder-genotype behavior
(pre-fix). Replaced with the new shape + real-allele tests.

678 tests pass; 93% coverage. Real Pt02 LENS file produces 409
ranked entries with real biological ref/alt nucleotides.

* LENS / pVACseq: stop inventing data — read every field from real columns

Audit found six places where vaxrank was fabricating values rather
than reading what's actually in the LENS / pVACseq files. Each fix
below replaces an invented value with a lookup against the real
column. Verified end-to-end on a real Hugo IPRES Pt02 LENS file:
ranked entries now carry real ref/alt, gene names, RNA-read counts,
transcript IDs, mutation spans, and peptide offsets — no
fabrications.

Inventions removed
------------------
1. ``load_lens`` set ``offset=0`` for every prediction. Real fix:
   compute via ``pep_context.find(peptide)``. (LENS centers the
   peptide within pep_context — it's not at position 0.)

2. ``_coerce_n_alt_reads(tpm)`` used TPM as a stand-in for alt-read
   count. TPM ≠ read count — different units, different meanings.
   Real fix: read ``rna_reads_covering_genomic_origin`` (total at
   locus) and ``rna_reads_covering_genomic_origin_with_peptide_cds``
   (alt-supporting). New helper ``_read_counts_from_lens_row`` does
   the lookup. ``_coerce_n_alt_reads`` deleted.

3. ``_mut_offsets_in_context`` falsely claimed the entire pep_context
   was the mutation span when the peptide couldn't be located in it.
   That told downstream code "every residue is mutated." Real fix:
   return ``(None, None)`` so the caller drops the row instead.

4. ``gene_name = ... or 'unknown'`` (LENS + pVACseq paths). Real fix:
   preserve empty string when LENS / pVACseq doesn't supply a name —
   the codebase convention for "not known" — and let the existing
   ``iter_named_antigens`` fallback handle the display.

5. pVACseq path hardcoded ``n_overlapping_reads=1, n_alt_reads=1,
   n_ref_reads=0``. Real fix: read ``RNA Depth`` (total coverage)
   and ``RNA VAF`` (variant allele frequency); compute
   ``n_alt = round(depth × vaf)``, ``n_ref = depth - alt``.

6. LENS path passed ``supporting_reference_transcripts=[]`` (empty).
   Real fix: pull ``transcript_id`` and split
   ``all_transcript_ids_encoding_peptide`` into a real list. Future
   varcode-effect annotation now has the actual transcript context.

Honest "no signal" defaults
---------------------------
When a real column is genuinely missing (NaN / None), default to 0
for read counts (honest "no signal") instead of 1 (a fabricated
weak signal). Users with LENS files lacking RNA columns will see
``combined_score`` collapse to expression-independent ranking; that's
correct behavior — we don't have read data for those rows.

Documentation
-------------
``overlaps_mutation=True`` and ``occurs_in_reference=False`` in
``load_lens`` are now documented as STRUCTURAL ASSUMPTIONS implied
by the fact that a row exists in a LENS report (LENS pre-curates
neoepitopes and pre-filters reference matches). Not invented — the
report's existence implies them.

Real Pt02 file end-to-end (post-fix)
-------------------------------------
406 ranked entries, each with real fields:
  chr3:150742446  A>''  SIAH2    reads=181/25  transcripts=1  mut_span=89-98
  chr17:29502684  T>''  TAOK1    reads=32/2    transcripts=1  mut_span=59-68
  chr6:56459264   G>'A' DST      reads=84/21   transcripts=1  mut_span=29-38

Pre-fix offset=0 → post-fix offset=133 (the actual position of the
first peptide in its 165-aa pep_context).

Tests
-----
- test_mut_offsets_in_context_returns_none_when_not_found replaces
  test_mut_offsets_in_context_falls_back_to_full_window. Pins the
  drop-the-row behavior.

678 tests pass; 92% coverage. Nothing in the LENS / pVACseq loaders
fabricates a value when the real column exists.

* LENS: derive wt_ic50 from mhcflurry_agretopicity (don't leave it None)

Audit of the LENS loader found that wt_ic50 was always None, even
though LENS does emit ``mhcflurry_agretopicity`` (defined as
MT_IC50 / WT_IC50). Real biological wildtype affinity can be
recovered as WT_IC50 = MT_IC50 / agretopicity.

Computed once per row (shared across detected predictors) and
applied only to the mhcflurry-affinity prediction (the value's
scale matches mhcflurry's IC50). Other predictors leave wt_ic50
None — the agretopicity is mhcflurry-specific.

Real Pt02 LENS file: 786 / 2153 predictions now carry real
wt_ic50 (the rest are non-SNV antigens — ERV / SPLICE / FUSION /
CTA-SELF — where mhcflurry_agretopicity is NaN). Sample:

  SPAMIPKDWPL  ic50=123.43  wt_ic50=14319.52  (strong neoepitope)
  KYWHIILGGGR  ic50=180.59  wt_ic50=20814.20  (strong neoepitope)
  GRFYGLDL     ic50=179.45  wt_ic50=  179.45  (silent — no improvement)

Test: existing toy fixture's mhcflurry_agretopicity=0.020 + IC50=95.4
→ wt_ic50 = 4770. Test updated to assert this rather than None.

Follow-up issues filed for bigger gaps surfaced in the audit
-------------------------------------------------------------
- #263: Surface non-SNV antigens (SPLICE / FUSION / CTA-SELF / ERV)
  from LENS — currently dropping ~30% of antigens silently because
  variant_coords is NaN for those antigen sources.
- #264: HLA LOH + B2M / TAP / antigen-presentation pathway integrity
  from LENS — clinically essential info that determines whether
  predictions for an allele or pathway are even meaningful.
- #265: Surface mhcflurry pres_score / pres_perc + proc_score from
  LENS — currently running pepsickle to compute what LENS already
  emits, and ignoring presentation score entirely.

678 tests pass; 92% coverage.

* LENS: warn on missing essential / important per-row data; correct #265 scope

Per-load summary
----------------
load_lens / ranked_from_lens_predictions now emit a per-load
summary of missing data so users know what's degraded:

  WARNING: N / total rows lack pep_context — antigens degenerate
  to bare neoepitope (~9 aa instead of ~25 aa SLP).
  INFO:    N / total rows lack gene_name (typical for ERV).
  INFO:    N / total rows lack transcript_id (no transcript context
  for downstream varcode-effect annotation).
  WARNING: N / total rows lack RNA-read counts — combined-score
  ranking collapses to epitope-only for those rows.

On a real Pt02 LENS file:
  337 / 2153 lack gene_name + transcript_id (ERV antigens)
  128 / 2153 lack RNA-read counts

Issue updates
-------------
- #263 (non-SNV antigens) — concrete plan added: fusion via
  varcode.StructuralVariant (already in varcode 4.x as BND), viral /
  CTA / ERV / splice as Antigen-subclass types. Land fusion first,
  then antigen-source tagging on MutantProteinFragment, then full
  Antigen abstraction.
- #264 (HLA LOH + APC pathway) — confirmed vaxrank has zero existing
  input format for this. Two-channel proposal: (A) extract from LENS
  rows automatically, (B) new --patient-context JSON/YAML for the
  VCF/BAM path. Both populate the same PatientInfo fields.
- #265 (mhcflurry pres / proc) — corrected: mhcflurry's per-peptide
  proc_score and pepsickle's per-position cleavage scores are
  COMPLEMENTARY, not substitutes. Surface both.

678 tests pass; 92% coverage.

* Pepsickle: default off (libomp clash on macOS); correct length scope

The macOS abort
---------------
On macOS, pepsickle's torch dependency ships its own libomp.dylib.
By the time vaxrank tries to load pepsickle, pandas / numpy /
pyarrow have already loaded a different libomp via OpenBLAS / MKL.
The second OpenMP runtime to init either aborts the process at
init time (loud, OMP Error #15) or — with KMP_DUPLICATE_LIB_OK=TRUE
set — segfaults later when the two runtimes collide during inference
(silent, exit 139).

KMP_DUPLICATE_LIB_OK=TRUE quiets the init-time abort but doesn't
prevent the inference-time segfault. So:

- Default --processing-aware-annotation flipped to False.
- The flag's help text documents the libomp issue and points users
  at typical Linux installs (single OpenMP) as the green path.
- Issue #266 filed for the robust fix (subprocess isolation, or
  ONNX-runtime alternative).

Test pinning the default flipped from on → off.

Length scope correction
-----------------------
Earlier comment on issue #265 / processing.py docstring said
"one number per 9-mer" for mhcflurry's processing score. Wrong:
mhcflurry handles MHC-I peptides 8-15 aa, and pepsickle's per-
position scoring is length-agnostic. Updated docstring to "one
number per peptide (8-15 aa for MHC-I)" and added a footnote on
``c_term × (1 - max_internal)`` being a *conservative approximation*
(strict version is ``Π(1 - p_i)``; max heuristic dominates because
proteasomes cut roughly once per substrate molecule).

Plus the safe-wins fixes from the earlier audit
------------------------------------------------
- LENS wt_ic50 derived from mhcflurry_agretopicity (= MT/WT ratio)
  for mhcflurry-affinity predictions. 786/2153 predictions in
  Pt02 now carry real wildtype affinity.
- Per-load summary warnings for missing essential / important data:
  * pep_context absent → SLP collapses to bare neoepitope (warning)
  * gene_name absent → typical for ERV (info)
  * transcript_id absent → no transcript context (info)
  * RNA-read counts absent → combined-score collapses to epitope-only
    (warning, suggests --combined-score-mode=epitope_only)

Pt02 produces 20 peptide constructs end-to-end with default flags;
no segfaults, no crashes.

Issues filed
------------
- #263 (non-SNV antigens): concrete plan with varcode.StructuralVariant
  for fusions, Antigen-subtype hierarchy for viral / CTA / ERV / splice
- #264 (HLA LOH + B2M / TAP integrity): two-channel input-format design
  (extract from LENS rows automatically + new --patient-context flag
  for VCF/BAM path)
- #265 (mhcflurry pres_score / proc_score): corrected scope —
  complementary to pepsickle, not a substitute
- #266 (libomp clash): subprocess isolation as the robust fix

678 tests pass; 92% coverage.

* Pepsickle: run in isolated subprocess; default back to on (#266)

Root-cause fix for the macOS libomp clash: torch's bundled libomp
collides with the parent's pandas / numpy / pyarrow OpenMP runtime,
which manifested as either OMP Error #15 (abort) or a quiet exit-139
segfault during pepsickle inference. Running pepsickle in a fresh
Python subprocess (only torch + pepsickle + json imported, no pandas)
gives torch a clean libomp load and no clash.

processing.py is a clean rewrite:
  - one linear flow in annotate_processing (group by source → score
    → apply per-peptide), no parallel paths
  - _cleavage_probs_via_subprocess returns {} on any failure and
    never raises; the caller treats missing keys as "no signal"
  - obsolete in-process _load_default_predictor removed

With subprocess isolation in place:
  - --processing-aware-annotation default flips back to True
  - the --no-processing-aware-annotation opt-outs added to the b16,
    pVACseq, LENS, and real-LENS-fixture pytest paths are gone —
    those workarounds existed only to dodge the OMP segfault under
    pytest+coverage, which the subprocess no longer triggers
  - test_processing_aware_annotation_default_on flipped accordingly

Verified end-to-end on Pt02: 2105 / 2113 EpitopePredictions
annotated across 489 unique source sequences in ~3s, no segfault;
666 / 666 pytest pass with pepsickle on by default.

* Pepsickle: prefix annotation fields with predictor name

Three columns previously named ``c_term_cleavage_prob``,
``max_internal_cut_prob``, ``processing_score`` are renamed to
``pepsickle_c_term_cleavage_prob``, ``pepsickle_max_internal_cut_prob``,
``pepsickle_processing_score``. Report column headers (HTML / PDF /
ASCII / JSON) get the matching ``Pepsickle`` prefix.

A reader looking at a CSV column needed to know — out of band — that
those values came from pepsickle; the names didn't say so. With the
prefix the predictor is unambiguous in data, log lines, and report
headers, and a future per-position cleavage predictor (NetChop,
PAProC, …) can land alongside without name collision.

No behavior change beyond the rename; all 666 tests pass with
pepsickle on by default.

* Fail fast when no --output-* flag is set (LENS / pVACseq path too)

``check_args`` already validated that at least one primary output
path was set, but it ran only on the VCF/BAM pipeline branch — the
external-input branch (--input-lens / --input-pvacseq) skipped it.
Result: a user could run a long LENS import to completion and end
up with nothing on disk, the only signal a quiet
``wrote=['(none)']`` log line late in the run.

Hoist the check above the input-dispatch branch so it fires for both
paths, before any input loading or prediction runs. Replace the
ad-hoc string-concatenated error with a table-driven message that
lists each --output-* flag and its purpose, sourced from a single
``_PRIMARY_OUTPUT_FLAGS`` tuple so the help text and the validation
list can't drift apart.

* Pepsickle: defer per-peptide arithmetic to mhctools (closes #267)

vaxrank's ``_per_position_processing`` and
``_composite_processing_score`` were reimplementing
``ProcessingPredictor.c_term_prob`` / ``max_internal_prob`` and
``score_cterm_anti_max_internal``, which already live in mhctools
and are byte-identical to the hand-rolled versions. Drop the local
helpers and call the mhctools ones directly. Net result: ~30 lines
of arithmetic gone from vaxrank, single source of truth in mhctools,
and a future scoring fix or length-edge-case tweak in mhctools flows
here automatically.

The unit tests for the two helpers go away with them — that
arithmetic is mhctools' responsibility to test, not vaxrank's. One
new test pins our local out-of-range guard around the slice helpers
(peptide span past source → ``(None, None)`` so the caller drops
the row).

Subprocess isolation for the macOS libomp clash stays here for now
— ``mhctools.Pepsickle`` doesn't yet offer it built-in. Tracked in
openvax/mhctools#200; once that ships, the subprocess wrapper here
collapses to ``Pepsickle(..., isolate_subprocess=True)``.

Imports of ``mhctools.processing_predictor`` at module level pull in
only the wrapper module — torch stays unloaded in the parent, so
subprocess isolation still works as designed.

* Reject pipeline-only reports on --input-lens / --input-pvacseq path

Passing ``--output-ascii-report`` (or ``--html`` / ``--pdf``) with
``--input-lens`` ran to completion and silently wrote nothing — the
template-report block in ``main()`` was gated on
``source == 'pipeline'`` and short-circuited on the external path.

The deeper problem is that ``TemplateDataCreator`` walks
``mutant_protein_fragment.predicted_effect()``, which calls
``variant.effect_on_transcript`` and expects pyensembl
``Transcript`` objects; LENS / pVACseq carry transcript IDs as
strings only. Variant-counting metadata (somatic / coding /
RNA-supported) is also only produced from VCF + BAM. Making the
external path produce template reports is its own structural
project (filed as #268) — for this PR, just reject the combination
in ``check_args`` with an explicit message that points the user at
the outputs that ARE reachable from external input
(``--output-csv`` / ``--output-xlsx-report`` /
``--output-neoepitope-report`` / ``--output-json-file``).

Pinned by ``test_lens_cli_errors_when_paired_with_pipeline_only_report``.

* Reference vaxrank#268 from check_args docstring

* Unify vaccine output flags: --vaccine-type single-valued + --vaccine-output

The old design used a multi-valued ``--vaccine-type peptide mrna``
plus separate ``--output-peptide`` / ``--output-mrna`` flags whose
names looked like data-format outputs (peer to ``--output-csv``)
but actually drove vaccine-construct *design*. That mismatch made
the flags easy to miss — passing ``--input-lens FILE`` alone left
``vaccine_type=['peptide']`` active but no peptide writer fired,
producing a quiet ``wrote=['(none)']`` log line and zero files on
disk.

CLI rework — one mode per run:

- ``--vaccine-type {peptide,mrna}`` — single-valued, default
  ``peptide``
- ``--vaccine-output PATH`` — destination, interpreted by the
  active type (peptide → FASTA file, mrna → directory)
- ``--vaccine-manifest`` — JSON manifest, shared schema across types
- ``--vaccine-order-form`` — peptide-only; ``check_args`` errors
  if --vaccine-type=mrna
- ``--vaccine-csv`` / ``--vaccine-csv-no-full-rows`` — mrna-only;
  ``check_args`` errors if --vaccine-type=peptide

Removed: ``--output-peptide``, ``--output-peptide-manifest``,
``--output-peptide-order-form``, ``--output-mrna``,
``--output-mrna-manifest``, ``--output-mrna-csv``,
``--output-mrna-csv-no-full-rows``.

Internal cleanup: ``_resolve_vaccine_types`` (returning a set) →
``_resolve_vaccine_type`` (returning a string); the dispatch-table
loop in ``_emit_outputs`` collapses to a single lookup; the
mismatched-flag warning machinery goes away because the new
companion flags are statically tied to one type.

Verified end-to-end on Pt02 LENS:
- ``--vaccine-type peptide --vaccine-output peptides.fasta``
  → 20 SLPs written
- ``--vaccine-type mrna --vaccine-output mrna_dir/``
  → cds / no_polyA / full FASTAs written

Test churn: ``test_resolve_vaccine_types_*`` collapses to one
single-value test; ``test_emit_outputs_warns_when_output_path_lacks_matching_type``
and ``test_output_path_without_matching_type_does_not_write`` are
no longer applicable (companion-flag mismatches are now hard
errors); ``_mrna_args`` test helper centralizes the mRNA-design
defaults so individual tests stay focused. 661 tests pass.

* Make ASCII / HTML / PDF reports work on --input-lens / --input-pvacseq (closes #268)

LENS and pVACseq aggregates carry every field the template reports
need — ``transcript_id`` (per-row Ensembl ID), ``variant_coords`` +
ref/alt allele columns, ``rna_reads_covering_genomic_origin``,
``gene_name``. Earlier the external-input path stored those as
strings on ``MutantProteinFragment.supporting_reference_transcripts``
and the template renderer crashed deep inside
``variant.effect_on_transcript`` (TypeError: expected Transcript,
got str), so the reports were short-circuited with
``source != 'pipeline'`` and ``check_args`` rejected the flags.

This change resolves the data so reports render end-to-end:

- ``_resolve_transcripts`` (vaxrank/external_input.py) — turns
  Ensembl-ID strings into pyensembl ``Transcript`` objects via the
  ``--ensembl-release``-configured genome. Strips version suffixes
  (``ENST00000312960.4`` → ``ENST00000312960``) since pyensembl 2.x
  doesn't auto-strip. Unresolvable IDs are dropped silently
  (DEBUG-logged) so a release mismatch downgrades rather than
  crashes.
- ``_parse_variant_coords`` strips the ``chr`` prefix LENS emits.
  Without this the variant short_description rendered as
  ``chrchr3 …`` and ``effect_on_transcript`` found no transcripts
  (pyensembl uses bare contigs).
- ``_patient_info_from_external`` synthesizes a ``PatientInfo`` from
  the loaded ranked output: ``num_somatic_variants`` =
  unique-variants-with-antigens, ``num_coding_effect_variants`` =
  unique-variants-with-resolved-Transcript,
  ``num_variants_with_rna_support`` = unique-variants-with-non-zero
  reads. ``load_external_ranked`` now returns a 4-tuple including
  this.
- ``MutantProteinFragment.predicted_effect()`` returns ``None`` when
  no transcripts resolve OR when varcode rejects every transcript
  (e.g. ``ReferenceMismatchError`` when LENS / pVACseq was called
  against a different reference than the configured pyensembl
  release). Empty list → None and a list of failures → None — the
  template renderer tolerates both.
- ``TemplateDataCreator._effect_data`` /
  ``_query_cancer_hotspots`` / ``compute_template_data``
  render ``"—"`` placeholders when the effect is None.
- ``external_input_arg_parser`` now exposes ``--ensembl-release``
  and pulls in ``add_optional_output_args`` /
  ``add_supplemental_report_args`` so the namespace shape matches
  the pipeline parser.
- The ``check_args`` rejection for "pipeline-only reports on LENS"
  is gone — those reports work now.
- ``main()`` drops the ``source != 'pipeline'`` short-circuit;
  template reports run on both paths through one code path.
- pVACseq's "Best Transcript" column is now consumed (was
  hardcoded to []).

Verified end-to-end on Pt02 LENS (HugoLo IPRES 2016, 2153
predictions / 489 source proteins) with ``--ensembl-release 75``:
ASCII report writes, 384 / 406 variants render with full effect
annotation (Effect type / Transcript name / ID / description); the
remaining 22 are reference-allele mismatches between LENS's
upstream caller and pyensembl 75's reference.

Test churn: ``test_parse_variant_coords_extracts_chr_pos`` updated
to assert the bare-contig form; ``_variant_from_lens_row`` SNV /
INDEL tests likewise; new
``test_lens_cli_writes_ascii_report`` pins the end-to-end behavior.

* Review fixes: stale docstring, specific exceptions, dedup, test gaps

Address every blocking finding from the /review on PR #262:

1. Stale docstring in ``_resolve_transcripts`` claimed pyensembl
   strips version suffixes internally; that's exactly what we
   discovered it *doesn't* do. Corrected to match the actual
   strip-then-lookup behavior.

2. Replace bare ``except Exception`` with specific exception types:
   - ``_resolve_transcripts``: ``(ValueError, KeyError)`` —
     pyensembl's not-found shape.
   - ``MutantProteinFragment.predicted_effect``: lazy-import
     ``varcode.errors.ReferenceMismatchError`` and catch
     ``(ReferenceMismatchError, ValueError, KeyError)``.
   Other exceptions propagate so genuine bugs aren't swallowed.

3. Aggregate INFO/WARN summary for transcript-resolution failures
   in ``ranked_from_lens_predictions`` and
   ``ranked_from_pvacseq_predictions``: a single end-of-load WARN
   covers (a) "had IDs but no --ensembl-release" and (b) "had IDs
   but release mismatch — N/M didn't resolve." The per-row debug
   logs stay so the individual IDs are still inspectable. End-to-end
   on Pt02 LENS without ``--ensembl-release`` now emits one clear
   WARN pointing the user at the fix instead of silently rendering
   "—" everywhere.

4. ``_KNOWN_VACCINE_TYPES`` set deleted; ``_resolve_vaccine_type``
   now derives known types from ``_VACCINE_TYPE_DISPATCH`` directly.
   Adding a new vaccine writer is one registration, not two.

5. Dead intermediate (``variants = [v for v, _ in ranked]; n_total
   = len(variants)``) replaced with ``len(ranked)``.
   ``num_variants_with_vaccine_peptides`` rolled into the same loop
   instead of a duplicate ``sum(...)``. Comment pinned that the
   "first VP carries the transcript" assumption holds because LENS /
   pVACseq loaders emit one VP per variant — flag if that ever
   changes.

6. ``_cleavage_probs_via_subprocess`` docstring now notes the ~1–2s
   per-call torch-import cost (one launch per
   ``annotate_processing``, batched across all sources) and points
   at openvax/mhctools#200 for the upstream fix.

7. New test coverage:
   - ``_resolve_transcripts``: version-suffix stripping, drop-on-
     unresolvable, no-genome short-circuit (3 tests, ``_StubGenome``
     fixture).
   - ``_patient_info_from_external``: proxy-count semantics + empty-
     ranked path.
   - ``MutantProteinFragment.predicted_effect``: returns None on
     empty transcripts AND when varcode raises
     ``ReferenceMismatchError`` for every transcript.
   - ``--ensembl-release`` reachable through the external-input
     parser (regression guard).

669 tests pass (was 661; +8 from this commit).

* pVACseq: strip chr prefix in _parse_pvacseq_id (matches LENS fix)

The earlier ``chr``-prefix fix on the LENS path was missing its
symmetric pVACseq counterpart. ``_parse_pvacseq_id`` builds Variants
with ``normalize_contig_names=False``, so a pVACseq ID like
``chr1-100000-100001-A-T`` would land on the Variant as
``contig='chr1'`` and silently break ``effect_on_transcript``
against pyensembl (which uses bare contigs).

Same one-liner strip as ``_parse_variant_coords``; tests updated to
assert the bare-contig invariant on both parsers. Also fix the
``check_args`` error wording — it claimed the listed flags were
"--output-* flags," but the list now includes ``--vaccine-output``
which doesn't carry the prefix.

669 / 669 pass.

* Pre-flight ensembl-release hint, subprocess module, README rewrite, 2.17.0

Round-tripping the /review output:

1. Subprocess body extracted from inline ``_SUBPROCESS_SCRIPT`` into
   importable ``vaxrank/_pepsickle_subprocess.py`` with a ``main()``
   entry. The body is unchanged (still pure ``mhctools.Pepsickle``
   I/O), but it's now testable without spawning a subprocess —
   ``test_pepsickle_subprocess_main_io_contract`` exercises the
   JSON in/out contract via patched mhctools, and
   ``test_pepsickle_subprocess_main_returns_2_when_mhctools_missing``
   pins the import-error exit-code path. Wrapper exists only because
   openvax/mhctools#200 hasn't shipped subprocess isolation yet.

2. Ensembl-release inference + pre-flight hint. New
   ``infer_genome_build_from_lens`` reads the LENS file's
   ``origin_descriptor`` column and returns ``GRCh37`` / ``GRCh38``
   / None based on Hsap37 / Hsap38 markers (LENS ERV-row format).
   ``check_args`` calls it at startup when external input is paired
   with template-report flags but no ``--ensembl-release`` is set,
   and emits a single WARN that names the inferred build and a
   plausible release. Verified live on Pt02: WARN now fires before
   any LENS load, naming GRCh38 + release 102 specifically.

3. Doc + version updates:
   - README.md: every ``--output-peptide`` / ``--output-mrna``
     example replaced with the new ``--vaccine-output`` form;
     ``--vaccine-type`` documented as single-valued (was
     multi-valued); LENS-driven full-report example added with
     ``--ensembl-release``.
   - vaxrank/version.py: 2.16.0 → 2.17.0 (additive features +
     breaking CLI: removed --output-peptide / --output-mrna /
     --output-peptide-manifest / --output-peptide-order-form /
     --output-mrna-manifest / --output-mrna-csv /
     --output-mrna-csv-no-full-rows).

4. New tests:
   - subprocess I/O contract (2 tests, no pepsickle dependency).
   - ``infer_genome_build_from_lens``: GRCh38 / GRCh37 / unknown
     (3 tests).
   - ``args.genome`` plumbing on the external path: ``parse +
     _resolve_ensembl_release → args.genome is EnsemblRelease(75)``.

675 / 675 tests pass.

* Pepsickle: drop in-tree subprocess wrapper; use mhctools 3.13.3 isolation

mhctools 3.13.3 (openvax/mhctools#201) ships built-in
``Pepsickle(isolate_subprocess=True)`` so the libomp-clash workaround
collapses from a vaxrank-side launcher + bespoke subprocess module
to a single kwarg.

Removed:
- ``vaxrank/_pepsickle_subprocess.py`` (the pure-mhctools subprocess
  body lifted to ``openvax/mhctools/_pepsickle_subprocess.py``).
- ``processing._cleavage_probs_via_subprocess`` and
  ``_SUBPROCESS_MODULE`` (was launching the deleted module).
- The two unit tests that exercised the local module's I/O contract.

Added:
- ``processing._load_default_predictor(human_only, threshold)`` —
  thin builder that constructs ``Pepsickle(isolate_subprocess=True)``
  and degrades to None with a logged warning if mhctools isn't
  importable. ``annotate_processing`` now resolves a default
  predictor through this helper and calls ``cleavage_probs`` directly
  on it (same loop body that the test-seam ``predictor=`` arg has
  always used).
- ``test_load_default_predictor_returns_none_when_mhctools_missing``
  pins the graceful-degrade path.

Bumped:
- ``requirements.txt``: mhctools >=3.13.3 (was >=3.13.1).

End-to-end on Pt02 LENS: 2105/2113 EpitopePredictions annotated
across 489 unique sources — same numbers as before the refactor,
confirming the subprocess invocation moved upstream without changing
results.

Test count: 674 (was 675; net −1 from removing the two
local-subprocess tests + adding one default-predictor smoke).

* Fix pepsickle perf cliff; promote ensembl-release WARN to pre-flight error; verify antigen kinds in skip logs

Three independent improvements that all came out of running PR #262
on real Pt02 data with auto-mode logging.

1. Perf cliff: ``annotate_processing`` was calling
   ``predictor.cleavage_probs(source)`` per unique source. Under
   ``Pepsickle(isolate_subprocess=True)`` that's a fresh subprocess
   PER call (mhctools 3.13.3's ``cleavage_probs(s)`` delegates to
   ``cleavage_probs_many([s])`` internally) — ~1-2s startup ×
   489 sources turned a 30s run into ~15 minutes "stuck." Switch
   to ``cleavage_probs_many`` for one batched subprocess
   invocation. End-to-end on Pt02: 3.2s total (was effectively
   hung). Stub predictors that only implement ``cleavage_probs``
   still work via the fallback per-source loop.

2. ``--ensembl-release`` requirement promoted from late warning to
   early error. External input + template-report flag (ASCII /
   HTML / PDF) without ``--ensembl-release`` would silently render
   a degraded report with empty effect annotations everywhere.
   ``check_args`` now raises ValueError pre-flight with the
   build-inferred release suggestion (``--ensembl-release 102``
   when ``origin_descriptor`` says GRCh38, etc.). The redundant
   loader-level WARN is gone — the release-mismatch case is still
   logged from the loader since pre-flight can't predict
   resolution success.

3. Empty-coords / no-gene / no-transcript skip logs now show the
   actual ``antigen_source`` breakdown instead of asserting
   "typical for X". Pt02 reveals the empty-coords skip is
   genuinely composed of CTA/SELF=276, ERV=209, SPLICE=126,
   FUSION=2 — every kind that's expected to lack genome coords.
   When SNV / INDEL rows lack these fields, a separate WARN
   surfaces it as a likely upstream LENS bug rather than burying
   it in the breakdown.

4. ``annotate_processing`` now emits a single aggregate WARN
   summarizing peptides skipped because they aren't substrings of
   their pep_context source (real LENS-data anomaly: 8/2113 on
   Pt02 — peptide and pep_context came from different
   isoform/annotation snapshots; see prior analysis). The WARN
   includes a representative (peptide, pep_context) pair the user
   can paste into an upstream bug report.

Test count: 676 (was 675; +2 new tests for the early-error and
peptide-not-in-context paths, −1 fixed test that pinned the now-
deleted late warning).

* Pre-flight ensembl-release suggestion: list installed releases instead of hardcoding 102

Previously the pre-flight error message hardcoded a suggestion of
``--ensembl-release 102`` for GRCh38 inputs. That number was
arbitrary — picked once during development as "stable mid-2020
default" — and ``origin_descriptor`` only tells us the *build*
(GRCh37 vs GRCh38), not the release. GRCh38 spans Ensembl 76
through ~114+.

Replace the hardcoded mapping with ``installed_ensembl_releases_for_build``,
which walks the pyensembl cache (``<platformdirs>/pyensembl/<build>/ensembl<N>``)
to enumerate releases the user actually has on disk. The pre-flight
error now lists those releases and suggests the latest, with an
explicit "origin_descriptor doesn't pin a specific release" caveat
so the user knows the suggestion is best-effort, not derived from
the file.

Three branches:
- Local cache has releases for this build → list them, suggest the
  latest as a quick-start.
- Cache empty AND build is GRCh37 → 75 is canonical (final mainline
  release); suggest installing it.
- Cache empty AND build is GRCh38 → no canonical release;
  acknowledge that and ask the user to pick one matching the LENS
  file's source build, with an install command.

Live verification on Pt02 (29 locally-installed GRCh38 releases):
the message now says "you have these GRCh38 releases installed:
[77, 78, ..., 113, 114]; try --ensembl-release 114" — concrete,
correct, and points at something the user can actually run without
first downloading data.

677 / 677 tests pass; +1 test pinning the cache-walk helper.

* LENS report quality: inputs section, DNA VAF, dedup epitopes, SLP-windowed fragment

Four fixes from running the LENS-driven ASCII report on real Pt02:

1. **Inputs section**: ``PatientInfo`` gains an ``inputs:
   list[(label, path)]`` field rendered verbatim by the report
   builder. The external (LENS / pVACseq) path now labels its file
   "LENS report: …" or "pVACseq report: …" instead of misleadingly
   stuffing it into ``vcf_paths`` and rendering as "VCF (somatic
   variants) path(s)". The legacy ``vcf_paths``/``bam_path`` fields
   stay for backward compat but are no longer overloaded by external
   paths.

2. **DNA VAF**: read LENS's ``vaf`` column (sits between
   ``mhcflurry_agretopicity`` and the DNA-clonal columns
   ``totcopynum`` / ``multiplicity`` / ``ccf``; tracks DNA VAF on
   real Pt02 data — 0.141 vs computed RNA VAF 0.138). Plumb through
   ``dna_vaf_by_variant`` so the report's "DNA VAF" line populates
   instead of "n/a". pVACseq path likewise reads its explicit
   ``DNA VAF`` column (separate from ``RNA VAF``).
   ``ranked_from_*_predictions`` now returns
   ``(ranked, dna_vaf_by_variant)`` so the dispatcher can thread
   it through.

3. **Dedup epitope rows**: real LENS files have multiple rows per
   ``(peptide, allele)`` when the same peptide is encoded by
   several transcripts (~5% of Pt02 rows). The loader was emitting
   one ``EpitopePrediction`` per row and ``ranked_from_lens_predictions``
   ``extend``ed them all into the VP, producing duplicate rows in
   the per-VP epitope table (one annotated by pepsickle, one not,
   because annotation iterates by id()). Dedup in the loader by a
   content key (peptide, allele, ic50, percentile_rank,
   prediction_method_name) before assembling the VP.

4. **Vaccine peptide too long → SLP window**: the LENS path was
   using ``pep_context`` directly as ``MutantProteinFragment.amino_acids``,
   but LENS sometimes emits a 100+ aa protein-prefix as
   pep_context. Vaccine peptides rendered as 100+ aa instead of
   the canonical 25mer. Add
   ``MutantProteinFragment.slp_window_around_mutation(...)`` —
   centers a target-length window on the mutation span — and use
   it from the LENS loader. The pipeline path gets correctly-sized
   fragments by construction (Isovar emits exactly the requested
   width); the LENS path now lands on the same shape via this
   shared helper instead of inheriting LENS's variable
   pep_context length. ``--vaccine-peptide-length`` (default 25)
   threads through ``load_external_ranked``.

End-to-end on Pt02 (``--ensembl-release 113``):
- Patient header: "LENS report: …"
- DNA VAF: 0.141, RNA VAF: 0.138 (distinct)
- Vaccine peptide: 25aa (was 103aa)
- One row per (peptide, allele) in the epitope table

677 / 677 tests pass.

* PatientInfo unification: pipeline path also uses inputs; LENS infers MHC alleles

Two improvements that share more code between VCF/BAM and external
(LENS / pVACseq) paths:

1. Pipeline path now populates ``PatientInfo.inputs`` with
   ``[("VCF (somatic variants)", path), ("BAM (RNAseq reads)", bam)]``
   so both code paths feed the same renderer with the same shape.
   The legacy ``vcf_paths`` / ``bam_path`` fields stay populated for
   backward-compat with previously-saved JSON; new code reads
   ``inputs`` exclusively.

2. ``_patient_info_from_external`` now takes ``predictions`` and
   infers the unique MHC alleles from them — LENS / pVACseq don't
   carry an explicit alleles header, but the alleles are implicit in
   the per-row predictions. The patient-info block now shows them
   instead of leaving "MHC alleles:" blank, with an explicit
   "(inferred from report)" suffix so the reader knows the source.

3. Dropped the "— origin_descriptor doesn't pin a specific release"
   qualifier from the pre-flight error per user feedback; the
   "(or whichever matches the build LENS used)" is sufficient.

677 / 677 tests pass.

* Report quality: name predictor, processing-agnostic columns, geometric-mean composite, modality-aware manufacturability

Five report-quality changes from real-Pt02 review:

1. **MHC predictor visible per row**: new ``Predictor`` column in
   the per-VP epitope table (mhcflurry / netmhcpan / etc.). LENS
   files can be multi-predictor; pipeline runs are usually
   single-predictor but consistent rendering doesn't hurt.

2. **Score column header named for clarity**: "Score" → "Score
   (affinity, logistic IC50)". The score is a logistic transform of
   IC50, not mhcflurry's presentation_score / EL — vaxrank computes
   it locally so it's comparable across predictors. The longer
   header eliminates the ambiguity.

3. **Predictor-agnostic processing columns**: column headers
   "Pepsickle C-term cut" / "Pepsickle max internal cut" /
   "Pepsickle processing score" → "Processing: C-term" /
   "Processing: max internal" / "Processing: combined". The
   underlying field names stay ``pepsickle_*`` so a future per-
   position predictor (NetChop, PAProC) can land alongside without
   collision; the active predictor is named once in the patient-info
   header as "Processing predictor: pepsickle".

4. **Composite score → geometric mean**: ``pepsickle_processing_score``
   was ``c_term * (1 - max_internal)`` (mhctools' canonical
   ``score_cterm_anti_max_internal``). Switch to ``sqrt(c_term *
   (1 - max_internal))`` — the geometric mean of the two factors.
   Less-aggressive penalty when both factors are mid-range; a row
   with c_term=0.6 and (1 - max_internal)=0.6 now scores ~0.6
   instead of 0.36, which matches how downstream readers interpret
   the combined as a "credibility tag" rather than a strict joint
   probability. Pt02 example: c_term=0.84, max_internal=0.38 →
   was 0.52, now 0.72.

5. **Modality-aware manufacturability default**: the GRAVY / Cys-
   content / N-terminal-Q manufacturability section in template
   reports defaults on for ``--vaccine-type=peptide`` (relevant to
   peptide synthesis) and off for ``--vaccine-type=mrna`` (those
   features don't apply to mRNA constructs). Explicit
   ``--include-manufacturability-in-report`` /
   ``--no-manufacturability-in-report`` still override.

677 / 677 tests pass.

* Generic, modality-aware reports: drop peptide-coded language; surface vaccine type

Cleanup of the report-generation surface so the same flags
(``--output-ascii-report`` / ``--output-html-report`` /
``--output-pdf-report`` / ``--output-csv`` / ``--output-xlsx-report``
/ ``--output-json-file`` / ``--output-neoepitope-report``) work
across every ``--vaccine-type`` and adapt their content to the
active modality, rather than always rendering peptide-specific
material.

Visible changes on Pt02 LENS:
- Patient header now opens with "Vaccine type: peptide" /
  "Vaccine type: mrna" so the rendered modality is explicit.
- Section heading "Vaccine Peptides:" → "Vaccine antigens:" —
  same content, modality-neutral name.
- Manufacturability section (GRAVY, Cys content, N-terminal Q/E/C,
  Asp-Pro bonds) renders by default for ``--vaccine-type=peptide``
  and is suppressed for ``--vaccine-type=mrna`` (those metrics are
  about peptide synthesis; mRNA constructs translate in vivo).
  Explicit ``--include-manufacturability-in-report`` /
  ``--no-manufacturability-in-report`` still override.

Help-text rewrites: every "vaccine peptide report" reference in
``arg_parser.py`` is now "summary report" (or similar) with an
explicit "antigen-centric; same flag for all --vaccine-type modes"
note. Future modalities (DNA, etc.) plug in by extending
``--vaccine-type`` choices and registering their writer in
``_VACCINE_TYPE_DISPATCH``; no rename of any output flag is needed.

Filed openvax/vaxrank#269 for the related but separate work of
HLA-balanced antigen packing across multi-construct mRNA vaccines —
not included here because it changes assembly semantics and needs
its own evaluation against the existing greedy strategy.

677 / 677 tests pass.

* Multi-mode --vaccine-type + --output-dir; canonical filenames per modality

Reworked the vaccine-type / output API to support designing
multiple modalities in one run while keeping the single-mode case
flat:

CLI changes (breaking):

- ``--vaccine-type`` is multi-valued again (``nargs='+'``,
  default ``['peptide']``). Previous single-mode runs still work
  (one value); multi-mode runs accept e.g. ``--vaccine-type
  peptide mrna``.

- Removed ``--vaccine-output``, ``--vaccine-manifest``,
  ``--vaccine-order-form``, ``--vaccine-csv``,
  ``--vaccine-csv-no-full-rows``. The path-companion flags went
  away because they couldn't disambiguate in multi-mode (which
  manifest? peptide's or mrna's?).

- Added ``--output-dir DIR``: always a directory, never a file
  path, no extension-based mode switching. Vaxrank picks
  canonical filenames inside.

- Added ``--mrna-csv-no-full-rows`` (the boolean knob from the
  removed ``--vaccine-csv-no-full-rows``; now lives in the mrna
  knob group).

Output layout:

- Single-mode (one --vaccine-type): canonical files land directly
  in ``--output-dir``.
    peptide → vaccine.fasta + manifest.json + order_form.csv
    mrna    → cds.fasta + no_polyA.fasta + full.fasta +
              manifest.json + layers.csv
- Multi-mode (≥2 types): per-modality subdirs so canonical
  filenames don't collide on ``manifest.json``.
    DIR/peptide/...
    DIR/mrna/...

Internals:

- ``_resolve_vaccine_type`` → ``_resolve_vaccine_types`` (returns
  ordered, deduplicated list). Bare-string callers still tolerated.
- New ``_vaccine_target_dir(output_dir, vtype, all_active)`` —
  shared "where do my files go" helper. Single-mode flat,
  multi-mode subdir.
- ``_emit_outputs`` iterates active types; each writer takes
  ``(args, ranked, target_dir)`` and uses canonical filenames
  inside.
- Manufacturability default tightened: was "on for
  vaccine_type=peptide", now "on iff 'peptide' in active types".
  Mixed peptide+mrna runs include manufacturability (peptide is
  active); mrna-only runs omit it.

Verified end-to-end on Pt02 LENS:
- ``--vaccine-type peptide --output-dir out/`` → flat layout with
  vaccine.fasta + manifest.json + order_form.csv
- ``--vaccine-type peptide mrna --output-dir out/`` → subdir
  layout: out/peptide/{...} + out/mrna/{...}, both written
  successfully

Test churn: ``_mrna_args`` rebuilt around the new fields;
``_resolve_vaccine_types`` test pins multi-mode + dedup +
first-occurrence order; new ``_vaccine_target_dir`` test pins
the single-vs-multi branching; new
``test_lens_with_multi_vaccine_type_uses_subdirs`` end-to-end
test; old ``test_emit_outputs_skips_writer_when_no_vaccine_output``
renamed to ``..._no_output_dir``. README + arg_parser docstrings
updated. 679 / 679 tests pass.

* Default --vaccine-type to 'peptide mrna'; address /review findings

Default change: ``--vaccine-type`` now defaults to
``['peptide', 'mrna']`` (was ``['peptide']``). Both modalities
share the same ranking, pepsickle annotation, and load work; only
the construct-assembly step differs. Designing both by default
matches what users typically want and leaves single-modality runs
to an explicit ``--vaccine-type peptide`` (or mrna).

/review findings addressed:

#2 Reject existing-file paths and FASTA-like suffixes in
   ``--output-dir`` early via ``check_args`` → new
   ``_reject_output_dir_misuse`` helper. Catches ``--output-dir
   out.fasta`` / ``--output-dir existing_file.txt`` before any
   ranking work happens, with an actionable error message.

#3 Multi-mode partial failures: ``_emit_outputs`` now wraps each
   writer in try/except and logs ERROR + continues, so an mrna
   assembly bug doesn't silently abort a peptide run (or vice
   versa). Other modalities still get attempted; final dispatch
   line records what fired.

#4 Cleaner dispatch log: ``wrote=[]`` instead of ``wrote=['(none)']``.
   Anyone parsing the line gets a real Python list instead of a
   list-with-sentinel.

#5 Manufacturability resolution dedups the shape-tolerance code:
   ``args.manufacturability = 'peptide' in _resolve_vaccine_types(args)``.
   One source of truth for the type-list parsing.

#9 ``--output-csv`` help text now spells out the format duality:
   pipeline path emits per-variant rows; LENS / pVACseq emit
   per-(peptide, allele) rows. Antigen-centric in both cases;
   doesn't change with --vaccine-type.

#10 New ``test_manufacturability_default_on_for_peptide_only``
    pins the modality-aware default for each combination of
    --vaccine-type, plus
    ``test_manufacturability_explicit_flag_overrides_default``
    confirms ``--include-manufacturability-in-report`` /
    ``--no-manufacturability-in-report`` still win.

#7 Test-fixture refactor: ``_mrna_args`` → ``_vaccine_args``
   covering both peptide and mrna knobs (alias for backward-compat
   in tests that already say ``_mrna_args``). The multi-mode test
   collapses from 18 lines of manual attribute injection to one
   call.

End-to-end on Pt02: default ``vaccine_type=['peptide', 'mrna']``
with ``--output-dir vaccines/`` writes vaccines/peptide/{...} and
vaccines/mrna/{...}; without ``--output-dir`` the dispatch line
shows ``wrote=[]`` and only requested analysis reports get written.
681 / 681 tests pass.

* YAML config: vaccine_constructs.{shared,peptide,mrna} with CLI override

New top-level YAML section ``vaccine_constructs`` that drives
construct-assembly knobs. Layout:

    vaccine_constructs:
      shared:    # cross-modality defaults (length bounds, antigen_content, …)
      peptide:   # peptide-specific (mode, linker, n-term acetyl, …)
      mrna:      # mRNA-specific (signal_peptide, UTRs, polyA, junctions, …)

Resolution order (highest precedence first):

    explicit CLI flag  >  per-modality config  >  shared config  >  built-in default

Implementation:

- ``vaxrank/config/schema.py``: new ``VaccineConstructsConfigSchema``
  with ``_SharedConstructConfigSchema``, ``_PeptideConstructConfigSchema``,
  ``_MrnaConstructConfigSchema`` subschemas. msgspec validates at
  YAML-load time so unknown keys (typos) error fast. Knobs whose
  defaults differ between modalities (linker,
  antigens_per_construct, max_constructs) live in the per-modality
  subsections; only knobs whose defaults are the same go in shared.

- ``vaxrank/config/loader.py``: new
  ``extract_construct_kwargs(config, modality)`` merges
  ``vaccine_constructs.shared`` with
  ``vaccine_constructs.<modality>``; modality-specific values
  override shared values. None entries are dropped so callers can
  ``dict.get(key, fallback)`` cleanly.

- ``parse_vaxrank_args`` now snapshots parser defaults into
  ``parsed._parser_defaults`` so downstream code can detect "user
  explicitly passed this flag" by comparing ``args.X`` to the
  snapshot.

- ``_emit_peptide_constructs`` / ``_emit_mrna_constructs`` consume
  YAML config via a small ``cfg(cli_attr, yaml_key)`` closure that
  resolves through the precedence chain. Every CLI flag they take
  is now config-overridable; CLI still wins when explicit.

- ``vaxrank/config/default.yaml``: new section with a complete
  commented-out template showing every knob and the layered
  ``shared`` / ``peptide`` / ``mrna`` structure. Default config is
  still a no-op — uncomment knobs to change behavior.

Tests:
- Four new ``test_extract_construct_kwargs_*`` tests pin the merge
  semantics (shared-only, modality-override, None-dropping,
  absent-section).
- ``test_construct_config_overrides_cli_default`` end-to-end:
  ``vaccine_constructs.peptide.linker: AAY`` drives the writer
  when the user runs without ``--peptide-linker``, and CLI
  ``--peptide-linker EAAAK`` overrides the YAML.

Live verification on lens_example.tsv: a config setting
``vaccine_constructs.peptide.linker: AAY`` plus
``antigens_per_construct: 3`` packs TP53 + BRAF + KRAS antigens
into one construct with AAY linkers visible in the FASTA output —
neither value passed on the CLI.

686 / 686 tests pass.

* Multi-mode dispatch: any writer failure aborts the whole run

Reverted the try/except + continue from the prior /review fix.
Partial vaccine output is worse than no vaccine output: ending up
with a peptide FASTA on disk but no mRNA directory (or vice versa)
is the kind of half-state that quietly ships to a downstream
reviewer. ``_emit_outputs`` now lets writer exceptions propagate so
the run fails loud.

The pre-raise log line names which modality crashed and which
already wrote — enough operator context to decide whether to clean
up partial files before retrying. Single-modality runs are
unaffected (one writer; failure already bubbled).

686 / 686 tests pass.

* Rename vaccine_constructs.shared → defaults

``shared`` reads weirdly in single-modality runs ("shared with
what?"). ``defaults`` reads cleanly at any --vaccine-type count:
"default values; per-modality blocks override them." Same merge
semantics as before:

    vaccine_constructs:
      defaults:  # values applied to every modality not overriding them
      peptide:   # peptide-specific overrides
      mrna:      # mRNA-specific overrides

Resolution order unchanged:

    explicit CLI flag  >  per-modality config  >  defaults config  >  built-in default

Schema, loader, default.yaml, tests all renamed in lockstep. No
behavior change. 686 / 686 tests pass.

* Flatten vaccine_constructs: cross-modality knobs at section top level

User's point: ``defaults:`` was a misnomer. Those values aren't
"defaults" the user is overriding — they're the actual values for
the run. Drop the subsection wrapper; cross-modality knobs go
directly under ``vaccine_constructs:``. Modality subsections
(``peptide:`` / ``mrna:``) are only for modality-specific overrides
or modality-only knobs.

Before:
    vaccine_constructs:
      defaults:
        max_antigen_length_aa…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pepsickle: optional subprocess isolation to avoid torch / pandas libomp clash on macOS

1 participant