[codex] Add pepsickle subprocess isolation by iskandr · Pull Request #201 · openvax/mhctools

iskandr · 2026-05-03T02:14:53Z

Summary

Adds opt-in subprocess isolation for mhctools.Pepsickle via Pepsickle(isolate_subprocess=True). The isolated path sends unique input sequences plus the existing human_only and threshold options to a short-lived Python subprocess over JSON stdin/stdout, so pepsickle can load torch without sharing the parent process already-loaded OpenMP runtimes.

The processing predictor base now has a cleavage_probs_many hook, used by predict, predict_proteins, and predict_cleavage_sites, so isolated pepsickle calls batch unique sequences into one subprocess invocation while preserving the existing cleavage_probs API.

Also bumps mhctools.__version__ from 3.13.2 to 3.13.3.

Fixes #200.

Validation

./lint.sh
./test.sh (397 passed, 9 skipped, 2 xfailed)

…tion mhctools 3.13.3 (openvax/mhctools#201) ships built-in ``Pepsickle(isolate_subprocess=True)`` so the libomp-clash workaround collapses from a vaxrank-side launcher + bespoke subprocess module to a single kwarg. Removed: - ``vaxrank/_pepsickle_subprocess.py`` (the pure-mhctools subprocess body lifted to ``openvax/mhctools/_pepsickle_subprocess.py``). - ``processing._cleavage_probs_via_subprocess`` and ``_SUBPROCESS_MODULE`` (was launching the deleted module). - The two unit tests that exercised the local module's I/O contract. Added: - ``processing._load_default_predictor(human_only, threshold)`` — thin builder that constructs ``Pepsickle(isolate_subprocess=True)`` and degrades to None with a logged warning if mhctools isn't importable. ``annotate_processing`` now resolves a default predictor through this helper and calls ``cleavage_probs`` directly on it (same loop body that the test-seam ``predictor=`` arg has always used). - ``test_load_default_predictor_returns_none_when_mhctools_missing`` pins the graceful-degrade path. Bumped: - ``requirements.txt``: mhctools >=3.13.3 (was >=3.13.1). End-to-end on Pt02 LENS: 2105/2113 EpitopePredictions annotated across 489 unique sources — same numbers as before the refactor, confirming the subprocess invocation moved upstream without changing results. Test count: 674 (was 675; net −1 from removing the two local-subprocess tests + adding one default-predictor smoke).

…otation (#262) * Pepsickle credibility tagging (#249) + LENS field-format fixes (#259, #260) #249 — Pepsickle credibility tagging ------------------------------------ mhcflurry-presentation already includes a flank-aware processing prior, so a per-k-mer "presentation score" implicitly captures whether the ligand would be cleaved out and presented. What it can't tell us: whether the proteasome would cut INSIDE the ligand (destroying it before MHC) or whether the C-terminal cut at the ligand's boundary is clean. Pepsickle (Weeder et al., Bioinformatics 2021) gives per-position cleavage probabilities — we use that as a credibility filter on existing MHC ligand predictions, not as a separate score axis. Each EpitopePrediction gains three optional fields, populated when ``--processing-aware-annotation`` is enabled (default: on): - c_term_cleavage_prob: prob of clean release at the C-terminus - max_internal_cut_prob: peak cleavage prob strictly inside the ligand (high → ligand likely destroyed before MHC) - processing_score: composite c_term × (1 − max_internal); 1.0 ideal Annotation is purely additive — never alters ranking. Surfaced in HTML / PDF / ASCII per-epitope tables as three extra columns when populated; absent reports keep their original 6-column shape. New module: vaxrank/processing.py - annotate_processing(predictions, predictor=None) — runs pepsickle once per unique source_sequence (not once per peptide); the predictor is loaded lazily and degrades gracefully if unavailable - _per_position_processing / _composite_processing_score helpers CLI flag: --processing-aware-annotation / --no-processing-aware-annotation (default ON). #259 + #260 — LENS field-format bug fixes ------------------------------------------ Surfaced while running on a real LENS v1.9-dev file (Hugo IPRES Pt10): #259: variant_coords parser only handled 4-part chr:pos:ref:alt; real LENS files emit 2-part chr:pos (no ref/alt), NaN (non-SNV antigen rows: splice / fusion / intron retention), and the literal string "nan". Pre-fix: 3000+ warnings, 0 ranked entries. Post-fix: parses all real-LENS forms, NaN rows skipped silently (info-level summary instead of per-row warning), unparseable rows aggregated into one warning. Placeholder ref/alt nucleotides ('A'/'C') used when LENS omits them — varcode rejects 'N' and empty strings. #260: write_neoepitope_report raised ValueError on duplicate (peptide, allele) rows. Real LENS files emit duplicates (same neoepitope from multiple transcripts / homologous regions). The score-merge already broadcasts the same score across duplicate rows correctly; the strict-uniqueness check was overly strict. Replaced with an info-level note documenting the duplicate count. Tests (+19 new) --------------- - test_processing.py: 14 tests covering helpers, annotation, ranking invariance, multi-source one-pass-per-source, off-by-one offset re-location, predictor failure degradation, and CLI flag default. - test_external_input.py: 5 new tests covering the relaxed variant_coords parser (2-part, 3-part, NaN, 'N' rejection) and a regression test against the real-LENS-v1.9 fixture. - test_epitope_io.py: replaced the strict-duplicates rejection test with one that pins score-broadcasting across duplicate rows. Real-world end-to-end: a real Hugo IPRES Pt10 LENS file now produces both --output-csv and --output-neoepitope-report XLSX without errors (was crashing pre-fix). Bumps 2.15.0 → 2.16.0. Subsumes #259, #260. * Review fixes: HTML header drift, placeholder-genotype provenance, closest-occurrence re-location, dedup contract, pepsickle CLI passthrough #1 (correctness) HTML/PDF/ASCII per-epitope table header drift -------------------------------------------------------------- Pre-fix: _epitope_data emitted a 6-key dict for unannotated predictions and a 9-key dict for annotated ones. The HTML template builds the header from p.epitopes[0].keys() — if the first epitope happened to be unannotated and a later one was annotated, the header had 6 columns but rows had 9 → malformed table. This was reachable any time pepsickle annotation succeeded for some sources but failed for others (graceful per-source skip is a documented behavior). Fix: _epitope_data takes an ``include_processing`` parameter; the caller computes ``any(c_term is not None for p in mutant_predictions)`` per VaccinePeptide and passes the result. When True, every row gets the 9-column shape (with '—' fallback for unannotated predictions), matching the header. When False, the legacy 6-column shape is returned unchanged so unannotated reports don't change. #2 (provenance) ref_alt_fallback placeholder genotype ----------------------------------------------------- _parse_variant_coords now returns ``(Variant, alleles_real)`` so callers can detect when the genotype is fictional (placeholder ref/alt because LENS only emitted chr:pos). When alleles_real is False the synthesized Variant must NOT be fed to varcode-effect annotation — its genotype isn't real biology. Documented in the docstring + an info-level summary log line counts how many such placeholders were used per LENS run. Construct assembly only uses (chr, pos) and is unaffected. #3 (correctness) Re-location picks the occurrence closest to the declared offset, not the first match ------------------------------------------------------------------- When the declared offset of a peptide doesn't match the source at that position, processing.py re-locates by substring search. Pre-fix this used source.find() — always returning the first occurrence even if the declared offset pointed at a later match (e.g. homopolymer tracts with repeated short ligands). New _closest_occurrence helper scans all matches and picks the one with smallest |pos - declared|. A drift > 3 positions emits a warning so upstream loaders with broken offset accounting are flagged. #4 (UX) CLI passthrough for pepsickle parameters ------------------------------------------------- New flags: --pepsickle-human-only (use the human-only-trained model) and --pepsickle-threshold N (cleavage probability threshold). Threaded through annotate_processing → _load_default_predictor. #6 (clarity) Dup-rows log message ---------------------------------- Was "from multi-source LENS / pVACseq input"; pipeline-mode duplicates (if a future loader bug introduces them) would be misattributed. Now "from multi-source input." #7 (debuggability) _load_default_predictor warns + debug-traceback ------------------------------------------------------------------- Pepsickle import / instantiation failures previously logged a single warning line. Now also logger.debug(exc_info=True) so genuine bugs (bad CUDA libs, torch version mismatch) surface at DEBUG without spamming WARNING. #8 (robustness) Dedup contract in _annotate_predictions_with_processing ------------------------------------------------------------------------ Pre-fix dedup keyed on id(p), which assumes the same EpitopePrediction object is reachable from both the ranked-vaccine-peptides intermediate and the LENS predictions list. That's true for the current external_input loader but a future loader that copies predictions into VaccinePeptides would break the assumption and double-annotate. Belt-and-suspenders: also dedup on the content tuple (peptide_sequence, allele, source_sequence, offset). Both checks are O(1). Tests (+5 new) -------------- - test_epitope_data_header_consistent_when_some_predictions_unannotated: pins #1 — annotated and unannotated rows produce identical key sets when include_processing=True - test_re_location_picks_closest_to_declared_offset: pins #3 — homopolymer source where declared offset points at the second occurrence; re-location must snap there, not to the first - test_re_location_warns_on_large_offset_drift: pins #3's >3aa warning - test_dedup_by_content_when_duplicate_objects: pins #8 — two distinct Python objects with the same (peptide, allele, source, offset) collapse to a single annotation pass - test_pepsickle_cli_param_passthrough: pins #4 — --pepsickle-human-only / --pepsickle-threshold parse correctly Plus updated tests for the (Variant, alleles_real) tuple return shape of _parse_variant_coords (5 existing tests adapted; one new alleles_real assertion per case). 667 tests pass; 93% coverage. * Stop-codon truncation, placeholder-genotype propagation, parametrized real-LENS CI, more review fixes LENS data quality (real-world Pt02 file from HugoLo IPRES 2016) --------------------------------------------------------------- Pt02 surfaced a third LENS bug: pep_context can contain '*' (stop codon) for stop-loss / readthrough variants. Pre-fix this crashed manufacturability scoring with KeyError: '*'. Fix: new helpers in vaccine_library.py: - truncate_at_stop_codon(aa) — stop at first '*'; everything after doesn't exist as protein. Translation halts at the stop. - has_only_standard_amino_acids(aa) — guard against selenocysteine 'U', pyrrolysine 'O', ambiguous 'X' / 'B' / 'Z' / 'J', etc. external_input.ranked_from_lens_predictions now truncates pep_context + peptide at first stop and drops rows where the result has any non-standard residue, with a warning. New fixture: tests/data/epitope_fixtures/real_lens_subsets/ lens_v1.9_with_stop_codons.tsv (52 rows from a real Pt02 dump) covers stop-codon, NaN variant_coords, multi-row-per-variant, and 2-part chr:pos shapes in one file. Parametrized end-to-end CI (no more "shouldn't have to run manually") ---------------------------------------------------------------------- test_real_lens_fixture_runs_end_to_end_via_cli is parametrized over every TSV in tests/data/epitope_fixtures/real_lens_subsets/. Each fixture drives main() with --input-lens / --output-csv / --output-neoepitope-report end-to-end. Future LENS shapes get covered by dropping a fixture in that directory; the test grows automatically. test_real_lens_fixtures_present pins the expected set. Review fixes (continuing from PR review) ----------------------------------------- #1 alleles_real propagation. Added MutantProteinFragment.placeholder_alleles (default False). LENS path sets it True when pep_context came from a 2-part chr:pos row (no real ref/alt). Documented that downstream varcode-effect annotation MUST detect this flag rather than silently do the wrong thing on the placeholder genotype. #2 pVACseq tuple consistency. _parse_pvacseq_id now returns (Variant, alleles_real) matching the LENS-path shape. pVACseq IDs always carry real ref/alt so alleles_real is True on success; the symmetric return shape lets future code handle both paths uniformly. #3 Empty peptide_sequence dedup degeneration. _maybe_add now skips content-key dedup for empty-peptide predictions (passes them through; the annotation step skips them on its own). #4 Drift threshold scales with source length: max(3, 5% × len). Typical 25-aa SLP sources keep the absolute threshold; full-protein sources allow more drift before warning. #5 annotate_processing warns when caller passes a pre-built predictor AND non-default human_only / threshold (the params are ignored — caller already configured the predictor). #7 Test refactor: monkeypatch fixture replaces manual try/finally in test_dedup_by_content_when_duplicate_objects so cleanup is automatic on test failure. Documentation ------------- README gains a Data Model section explaining VaccinePeptide / MutantProteinFragment / EpitopePrediction structure, the variant → multiple VPs → multiple constructs fan-out, and where each per-VP report renders. Removes the "what's a VP?" reviewer trip-up. Tests (+10 new) --------------- - 4× parametrized end-to-end real-LENS CLI (one per fixture) - test_real_lens_fixtures_present (coverage gate) - test_lens_pep_context_with_stop_codon_truncates - test_truncate_at_stop_codon_helper - test_has_only_standard_amino_acids_helper - test_annotate_processing_warns_when_predictor_and_params_supplied - test_drift_threshold_scales_with_source_length 677 tests pass; 93% coverage. Real Pt02 LENS file produces both CSV + XLSX outputs end-to-end (was crashing pre-fix on stop codons). * LENS: use real snv_*_allele / indel_*_allele columns instead of placeholder ref/alt The previous fix substituted A/C placeholder nucleotides when LENS variant_coords arrived as 2-part chr:pos — but LENS files actually DO carry real ref/alt, just in dedicated per-source columns: SNV rows → snv_ref_allele, snv_alt_allele ('C', '[T]') INDEL rows → indel_ref_allele, indel_alt_allele ('CA', '[C]') The brackets are LENS's notation (multi-allelic-ready, though we've seen only single-allele cases in real files). The dedicated columns are populated for every SNV / INDEL row in the patient datasets we have, so there's no need to invent placeholder genotypes. Inventing a fictional A→C variant was a hack and would have caused varcode- effect annotation downstream to do the wrong thing silently. Architecture ------------ - ``_parse_variant_coords(coords)`` now extracts only ``(contig, pos)`` — a focused chr:pos parser. NaN / empty / malformed → None. - New ``_strip_lens_allele(value)`` strips the bracket notation (``'[T]'`` → ``'T'``). - New ``_variant_from_lens_row(row, genome=None)`` is the row-aware helper: extracts (contig, pos) from variant_coords AND looks up the real ref/alt from snv_*_allele or indel_*_allele based on ``antigen_source``. Returns the real-genotype Variant or None. - ``ranked_from_lens_predictions`` now calls ``_variant_from_lens_row`` on the representative row of each variant_coords group; placeholder genotypes are never produced. - ``MutantProteinFragment.placeholder_alleles`` is now always False on the LENS path. Field retained for hypothetical future loaders that genuinely lack alleles. Real Pt02 file (Hugo IPRES 2016) end-to-end -------------------------------------------- Before: 0 ranked entries (parser couldn't fabricate alleles cleanly). After: 409 ranked entries with REAL ref/alt: chr3:150742446 A> SIAH2 (CA → C frameshift, varcode-normalized) chr17:29502684 T> TAOK1 chr6:56459264 G>A DST (missense SNV) Test fixtures ------------- Added snv_ref_allele / snv_alt_allele / indel_ref_allele / indel_alt_allele columns to the toy LENS fixtures (lens_example.tsv, lens_multi_row_per_variant.tsv, lens_v1.4_with_stability.tsv, lens_v1.9_mhcflurry_only.tsv). The 4 real-LENS subset fixtures already had these columns — no changes needed. Tests (+5 new, 4 updated) ------------------------- - test_parse_variant_coords_extracts_chr_pos: pins the new (chr, pos)- only return shape. - test_parse_variant_coords_returns_none_on_missing_or_malformed: NaN / 'nan' / '' / garbage all return None. - test_strip_lens_allele_handles_bracket_form: '[T]' / '[CA]' / 'T' / None / 'nan' edge cases. - test_variant_from_lens_row_uses_real_snv_alleles: SNV path uses snv_ref_allele / snv_alt_allele. - test_variant_from_lens_row_uses_real_indel_alleles: INDEL path uses indel_*_allele; pins the varcode normalization (CA → C becomes ref='A' alt='' at start+1). - test_variant_from_lens_row_skips_non_snv_indel: SPLICE / FUSION / CTA-SELF / ERV → None (caller skipped earlier on NaN coords anyway). - test_variant_from_lens_row_skips_when_alleles_missing: defends the hypothetical future case of an SNV row missing its allele columns. Removed: 5 tests that asserted the placeholder-genotype behavior (pre-fix). Replaced with the new shape + real-allele tests. 678 tests pass; 93% coverage. Real Pt02 LENS file produces 409 ranked entries with real biological ref/alt nucleotides. * LENS / pVACseq: stop inventing data — read every field from real columns Audit found six places where vaxrank was fabricating values rather than reading what's actually in the LENS / pVACseq files. Each fix below replaces an invented value with a lookup against the real column. Verified end-to-end on a real Hugo IPRES Pt02 LENS file: ranked entries now carry real ref/alt, gene names, RNA-read counts, transcript IDs, mutation spans, and peptide offsets — no fabrications. Inventions removed ------------------ 1. ``load_lens`` set ``offset=0`` for every prediction. Real fix: compute via ``pep_context.find(peptide)``. (LENS centers the peptide within pep_context — it's not at position 0.) 2. ``_coerce_n_alt_reads(tpm)`` used TPM as a stand-in for alt-read count. TPM ≠ read count — different units, different meanings. Real fix: read ``rna_reads_covering_genomic_origin`` (total at locus) and ``rna_reads_covering_genomic_origin_with_peptide_cds`` (alt-supporting). New helper ``_read_counts_from_lens_row`` does the lookup. ``_coerce_n_alt_reads`` deleted. 3. ``_mut_offsets_in_context`` falsely claimed the entire pep_context was the mutation span when the peptide couldn't be located in it. That told downstream code "every residue is mutated." Real fix: return ``(None, None)`` so the caller drops the row instead. 4. ``gene_name = ... or 'unknown'`` (LENS + pVACseq paths). Real fix: preserve empty string when LENS / pVACseq doesn't supply a name — the codebase convention for "not known" — and let the existing ``iter_named_antigens`` fallback handle the display. 5. pVACseq path hardcoded ``n_overlapping_reads=1, n_alt_reads=1, n_ref_reads=0``. Real fix: read ``RNA Depth`` (total coverage) and ``RNA VAF`` (variant allele frequency); compute ``n_alt = round(depth × vaf)``, ``n_ref = depth - alt``. 6. LENS path passed ``supporting_reference_transcripts=[]`` (empty). Real fix: pull ``transcript_id`` and split ``all_transcript_ids_encoding_peptide`` into a real list. Future varcode-effect annotation now has the actual transcript context. Honest "no signal" defaults --------------------------- When a real column is genuinely missing (NaN / None), default to 0 for read counts (honest "no signal") instead of 1 (a fabricated weak signal). Users with LENS files lacking RNA columns will see ``combined_score`` collapse to expression-independent ranking; that's correct behavior — we don't have read data for those rows. Documentation ------------- ``overlaps_mutation=True`` and ``occurs_in_reference=False`` in ``load_lens`` are now documented as STRUCTURAL ASSUMPTIONS implied by the fact that a row exists in a LENS report (LENS pre-curates neoepitopes and pre-filters reference matches). Not invented — the report's existence implies them. Real Pt02 file end-to-end (post-fix) ------------------------------------- 406 ranked entries, each with real fields: chr3:150742446 A>'' SIAH2 reads=181/25 transcripts=1 mut_span=89-98 chr17:29502684 T>'' TAOK1 reads=32/2 transcripts=1 mut_span=59-68 chr6:56459264 G>'A' DST reads=84/21 transcripts=1 mut_span=29-38 Pre-fix offset=0 → post-fix offset=133 (the actual position of the first peptide in its 165-aa pep_context). Tests ----- - test_mut_offsets_in_context_returns_none_when_not_found replaces test_mut_offsets_in_context_falls_back_to_full_window. Pins the drop-the-row behavior. 678 tests pass; 92% coverage. Nothing in the LENS / pVACseq loaders fabricates a value when the real column exists. * LENS: derive wt_ic50 from mhcflurry_agretopicity (don't leave it None) Audit of the LENS loader found that wt_ic50 was always None, even though LENS does emit ``mhcflurry_agretopicity`` (defined as MT_IC50 / WT_IC50). Real biological wildtype affinity can be recovered as WT_IC50 = MT_IC50 / agretopicity. Computed once per row (shared across detected predictors) and applied only to the mhcflurry-affinity prediction (the value's scale matches mhcflurry's IC50). Other predictors leave wt_ic50 None — the agretopicity is mhcflurry-specific. Real Pt02 LENS file: 786 / 2153 predictions now carry real wt_ic50 (the rest are non-SNV antigens — ERV / SPLICE / FUSION / CTA-SELF — where mhcflurry_agretopicity is NaN). Sample: SPAMIPKDWPL ic50=123.43 wt_ic50=14319.52 (strong neoepitope) KYWHIILGGGR ic50=180.59 wt_ic50=20814.20 (strong neoepitope) GRFYGLDL ic50=179.45 wt_ic50= 179.45 (silent — no improvement) Test: existing toy fixture's mhcflurry_agretopicity=0.020 + IC50=95.4 → wt_ic50 = 4770. Test updated to assert this rather than None. Follow-up issues filed for bigger gaps surfaced in the audit ------------------------------------------------------------- - #263: Surface non-SNV antigens (SPLICE / FUSION / CTA-SELF / ERV) from LENS — currently dropping ~30% of antigens silently because variant_coords is NaN for those antigen sources. - #264: HLA LOH + B2M / TAP / antigen-presentation pathway integrity from LENS — clinically essential info that determines whether predictions for an allele or pathway are even meaningful. - #265: Surface mhcflurry pres_score / pres_perc + proc_score from LENS — currently running pepsickle to compute what LENS already emits, and ignoring presentation score entirely. 678 tests pass; 92% coverage. * LENS: warn on missing essential / important per-row data; correct #265 scope Per-load summary ---------------- load_lens / ranked_from_lens_predictions now emit a per-load summary of missing data so users know what's degraded: WARNING: N / total rows lack pep_context — antigens degenerate to bare neoepitope (~9 aa instead of ~25 aa SLP). INFO: N / total rows lack gene_name (typical for ERV). INFO: N / total rows lack transcript_id (no transcript context for downstream varcode-effect annotation). WARNING: N / total rows lack RNA-read counts — combined-score ranking collapses to epitope-only for those rows. On a real Pt02 LENS file: 337 / 2153 lack gene_name + transcript_id (ERV antigens) 128 / 2153 lack RNA-read counts Issue updates ------------- - #263 (non-SNV antigens) — concrete plan added: fusion via varcode.StructuralVariant (already in varcode 4.x as BND), viral / CTA / ERV / splice as Antigen-subclass types. Land fusion first, then antigen-source tagging on MutantProteinFragment, then full Antigen abstraction. - #264 (HLA LOH + APC pathway) — confirmed vaxrank has zero existing input format for this. Two-channel proposal: (A) extract from LENS rows automatically, (B) new --patient-context JSON/YAML for the VCF/BAM path. Both populate the same PatientInfo fields. - #265 (mhcflurry pres / proc) — corrected: mhcflurry's per-peptide proc_score and pepsickle's per-position cleavage scores are COMPLEMENTARY, not substitutes. Surface both. 678 tests pass; 92% coverage. * Pepsickle: default off (libomp clash on macOS); correct length scope The macOS abort --------------- On macOS, pepsickle's torch dependency ships its own libomp.dylib. By the time vaxrank tries to load pepsickle, pandas / numpy / pyarrow have already loaded a different libomp via OpenBLAS / MKL. The second OpenMP runtime to init either aborts the process at init time (loud, OMP Error #15) or — with KMP_DUPLICATE_LIB_OK=TRUE set — segfaults later when the two runtimes collide during inference (silent, exit 139). KMP_DUPLICATE_LIB_OK=TRUE quiets the init-time abort but doesn't prevent the inference-time segfault. So: - Default --processing-aware-annotation flipped to False. - The flag's help text documents the libomp issue and points users at typical Linux installs (single OpenMP) as the green path. - Issue #266 filed for the robust fix (subprocess isolation, or ONNX-runtime alternative). Test pinning the default flipped from on → off. Length scope correction ----------------------- Earlier comment on issue #265 / processing.py docstring said "one number per 9-mer" for mhcflurry's processing score. Wrong: mhcflurry handles MHC-I peptides 8-15 aa, and pepsickle's per- position scoring is length-agnostic. Updated docstring to "one number per peptide (8-15 aa for MHC-I)" and added a footnote on ``c_term × (1 - max_internal)`` being a *conservative approximation* (strict version is ``Π(1 - p_i)``; max heuristic dominates because proteasomes cut roughly once per substrate molecule). Plus the safe-wins fixes from the earlier audit ------------------------------------------------ - LENS wt_ic50 derived from mhcflurry_agretopicity (= MT/WT ratio) for mhcflurry-affinity predictions. 786/2153 predictions in Pt02 now carry real wildtype affinity. - Per-load summary warnings for missing essential / important data: * pep_context absent → SLP collapses to bare neoepitope (warning) * gene_name absent → typical for ERV (info) * transcript_id absent → no transcript context (info) * RNA-read counts absent → combined-score collapses to epitope-only (warning, suggests --combined-score-mode=epitope_only) Pt02 produces 20 peptide constructs end-to-end with default flags; no segfaults, no crashes. Issues filed ------------ - #263 (non-SNV antigens): concrete plan with varcode.StructuralVariant for fusions, Antigen-subtype hierarchy for viral / CTA / ERV / splice - #264 (HLA LOH + B2M / TAP integrity): two-channel input-format design (extract from LENS rows automatically + new --patient-context flag for VCF/BAM path) - #265 (mhcflurry pres_score / proc_score): corrected scope — complementary to pepsickle, not a substitute - #266 (libomp clash): subprocess isolation as the robust fix 678 tests pass; 92% coverage. * Pepsickle: run in isolated subprocess; default back to on (#266) Root-cause fix for the macOS libomp clash: torch's bundled libomp collides with the parent's pandas / numpy / pyarrow OpenMP runtime, which manifested as either OMP Error #15 (abort) or a quiet exit-139 segfault during pepsickle inference. Running pepsickle in a fresh Python subprocess (only torch + pepsickle + json imported, no pandas) gives torch a clean libomp load and no clash. processing.py is a clean rewrite: - one linear flow in annotate_processing (group by source → score → apply per-peptide), no parallel paths - _cleavage_probs_via_subprocess returns {} on any failure and never raises; the caller treats missing keys as "no signal" - obsolete in-process _load_default_predictor removed With subprocess isolation in place: - --processing-aware-annotation default flips back to True - the --no-processing-aware-annotation opt-outs added to the b16, pVACseq, LENS, and real-LENS-fixture pytest paths are gone — those workarounds existed only to dodge the OMP segfault under pytest+coverage, which the subprocess no longer triggers - test_processing_aware_annotation_default_on flipped accordingly Verified end-to-end on Pt02: 2105 / 2113 EpitopePredictions annotated across 489 unique source sequences in ~3s, no segfault; 666 / 666 pytest pass with pepsickle on by default. * Pepsickle: prefix annotation fields with predictor name Three columns previously named ``c_term_cleavage_prob``, ``max_internal_cut_prob``, ``processing_score`` are renamed to ``pepsickle_c_term_cleavage_prob``, ``pepsickle_max_internal_cut_prob``, ``pepsickle_processing_score``. Report column headers (HTML / PDF / ASCII / JSON) get the matching ``Pepsickle`` prefix. A reader looking at a CSV column needed to know — out of band — that those values came from pepsickle; the names didn't say so. With the prefix the predictor is unambiguous in data, log lines, and report headers, and a future per-position cleavage predictor (NetChop, PAProC, …) can land alongside without name collision. No behavior change beyond the rename; all 666 tests pass with pepsickle on by default. * Fail fast when no --output-* flag is set (LENS / pVACseq path too) ``check_args`` already validated that at least one primary output path was set, but it ran only on the VCF/BAM pipeline branch — the external-input branch (--input-lens / --input-pvacseq) skipped it. Result: a user could run a long LENS import to completion and end up with nothing on disk, the only signal a quiet ``wrote=['(none)']`` log line late in the run. Hoist the check above the input-dispatch branch so it fires for both paths, before any input loading or prediction runs. Replace the ad-hoc string-concatenated error with a table-driven message that lists each --output-* flag and its purpose, sourced from a single ``_PRIMARY_OUTPUT_FLAGS`` tuple so the help text and the validation list can't drift apart. * Pepsickle: defer per-peptide arithmetic to mhctools (closes #267) vaxrank's ``_per_position_processing`` and ``_composite_processing_score`` were reimplementing ``ProcessingPredictor.c_term_prob`` / ``max_internal_prob`` and ``score_cterm_anti_max_internal``, which already live in mhctools and are byte-identical to the hand-rolled versions. Drop the local helpers and call the mhctools ones directly. Net result: ~30 lines of arithmetic gone from vaxrank, single source of truth in mhctools, and a future scoring fix or length-edge-case tweak in mhctools flows here automatically. The unit tests for the two helpers go away with them — that arithmetic is mhctools' responsibility to test, not vaxrank's. One new test pins our local out-of-range guard around the slice helpers (peptide span past source → ``(None, None)`` so the caller drops the row). Subprocess isolation for the macOS libomp clash stays here for now — ``mhctools.Pepsickle`` doesn't yet offer it built-in. Tracked in openvax/mhctools#200; once that ships, the subprocess wrapper here collapses to ``Pepsickle(..., isolate_subprocess=True)``. Imports of ``mhctools.processing_predictor`` at module level pull in only the wrapper module — torch stays unloaded in the parent, so subprocess isolation still works as designed. * Reject pipeline-only reports on --input-lens / --input-pvacseq path Passing ``--output-ascii-report`` (or ``--html`` / ``--pdf``) with ``--input-lens`` ran to completion and silently wrote nothing — the template-report block in ``main()`` was gated on ``source == 'pipeline'`` and short-circuited on the external path. The deeper problem is that ``TemplateDataCreator`` walks ``mutant_protein_fragment.predicted_effect()``, which calls ``variant.effect_on_transcript`` and expects pyensembl ``Transcript`` objects; LENS / pVACseq carry transcript IDs as strings only. Variant-counting metadata (somatic / coding / RNA-supported) is also only produced from VCF + BAM. Making the external path produce template reports is its own structural project (filed as #268) — for this PR, just reject the combination in ``check_args`` with an explicit message that points the user at the outputs that ARE reachable from external input (``--output-csv`` / ``--output-xlsx-report`` / ``--output-neoepitope-report`` / ``--output-json-file``). Pinned by ``test_lens_cli_errors_when_paired_with_pipeline_only_report``. * Reference vaxrank#268 from check_args docstring * Unify vaccine output flags: --vaccine-type single-valued + --vaccine-output The old design used a multi-valued ``--vaccine-type peptide mrna`` plus separate ``--output-peptide`` / ``--output-mrna`` flags whose names looked like data-format outputs (peer to ``--output-csv``) but actually drove vaccine-construct *design*. That mismatch made the flags easy to miss — passing ``--input-lens FILE`` alone left ``vaccine_type=['peptide']`` active but no peptide writer fired, producing a quiet ``wrote=['(none)']`` log line and zero files on disk. CLI rework — one mode per run: - ``--vaccine-type {peptide,mrna}`` — single-valued, default ``peptide`` - ``--vaccine-output PATH`` — destination, interpreted by the active type (peptide → FASTA file, mrna → directory) - ``--vaccine-manifest`` — JSON manifest, shared schema across types - ``--vaccine-order-form`` — peptide-only; ``check_args`` errors if --vaccine-type=mrna - ``--vaccine-csv`` / ``--vaccine-csv-no-full-rows`` — mrna-only; ``check_args`` errors if --vaccine-type=peptide Removed: ``--output-peptide``, ``--output-peptide-manifest``, ``--output-peptide-order-form``, ``--output-mrna``, ``--output-mrna-manifest``, ``--output-mrna-csv``, ``--output-mrna-csv-no-full-rows``. Internal cleanup: ``_resolve_vaccine_types`` (returning a set) → ``_resolve_vaccine_type`` (returning a string); the dispatch-table loop in ``_emit_outputs`` collapses to a single lookup; the mismatched-flag warning machinery goes away because the new companion flags are statically tied to one type. Verified end-to-end on Pt02 LENS: - ``--vaccine-type peptide --vaccine-output peptides.fasta`` → 20 SLPs written - ``--vaccine-type mrna --vaccine-output mrna_dir/`` → cds / no_polyA / full FASTAs written Test churn: ``test_resolve_vaccine_types_*`` collapses to one single-value test; ``test_emit_outputs_warns_when_output_path_lacks_matching_type`` and ``test_output_path_without_matching_type_does_not_write`` are no longer applicable (companion-flag mismatches are now hard errors); ``_mrna_args`` test helper centralizes the mRNA-design defaults so individual tests stay focused. 661 tests pass. * Make ASCII / HTML / PDF reports work on --input-lens / --input-pvacseq (closes #268) LENS and pVACseq aggregates carry every field the template reports need — ``transcript_id`` (per-row Ensembl ID), ``variant_coords`` + ref/alt allele columns, ``rna_reads_covering_genomic_origin``, ``gene_name``. Earlier the external-input path stored those as strings on ``MutantProteinFragment.supporting_reference_transcripts`` and the template renderer crashed deep inside ``variant.effect_on_transcript`` (TypeError: expected Transcript, got str), so the reports were short-circuited with ``source != 'pipeline'`` and ``check_args`` rejected the flags. This change resolves the data so reports render end-to-end: - ``_resolve_transcripts`` (vaxrank/external_input.py) — turns Ensembl-ID strings into pyensembl ``Transcript`` objects via the ``--ensembl-release``-configured genome. Strips version suffixes (``ENST00000312960.4`` → ``ENST00000312960``) since pyensembl 2.x doesn't auto-strip. Unresolvable IDs are dropped silently (DEBUG-logged) so a release mismatch downgrades rather than crashes. - ``_parse_variant_coords`` strips the ``chr`` prefix LENS emits. Without this the variant short_description rendered as ``chrchr3 …`` and ``effect_on_transcript`` found no transcripts (pyensembl uses bare contigs). - ``_patient_info_from_external`` synthesizes a ``PatientInfo`` from the loaded ranked output: ``num_somatic_variants`` = unique-variants-with-antigens, ``num_coding_effect_variants`` = unique-variants-with-resolved-Transcript, ``num_variants_with_rna_support`` = unique-variants-with-non-zero reads. ``load_external_ranked`` now returns a 4-tuple including this. - ``MutantProteinFragment.predicted_effect()`` returns ``None`` when no transcripts resolve OR when varcode rejects every transcript (e.g. ``ReferenceMismatchError`` when LENS / pVACseq was called against a different reference than the configured pyensembl release). Empty list → None and a list of failures → None — the template renderer tolerates both. - ``TemplateDataCreator._effect_data`` / ``_query_cancer_hotspots`` / ``compute_template_data`` render ``"—"`` placeholders when the effect is None. - ``external_input_arg_parser`` now exposes ``--ensembl-release`` and pulls in ``add_optional_output_args`` / ``add_supplemental_report_args`` so the namespace shape matches the pipeline parser. - The ``check_args`` rejection for "pipeline-only reports on LENS" is gone — those reports work now. - ``main()`` drops the ``source != 'pipeline'`` short-circuit; template reports run on both paths through one code path. - pVACseq's "Best Transcript" column is now consumed (was hardcoded to []). Verified end-to-end on Pt02 LENS (HugoLo IPRES 2016, 2153 predictions / 489 source proteins) with ``--ensembl-release 75``: ASCII report writes, 384 / 406 variants render with full effect annotation (Effect type / Transcript name / ID / description); the remaining 22 are reference-allele mismatches between LENS's upstream caller and pyensembl 75's reference. Test churn: ``test_parse_variant_coords_extracts_chr_pos`` updated to assert the bare-contig form; ``_variant_from_lens_row`` SNV / INDEL tests likewise; new ``test_lens_cli_writes_ascii_report`` pins the end-to-end behavior. * Review fixes: stale docstring, specific exceptions, dedup, test gaps Address every blocking finding from the /review on PR #262: 1. Stale docstring in ``_resolve_transcripts`` claimed pyensembl strips version suffixes internally; that's exactly what we discovered it *doesn't* do. Corrected to match the actual strip-then-lookup behavior. 2. Replace bare ``except Exception`` with specific exception types: - ``_resolve_transcripts``: ``(ValueError, KeyError)`` — pyensembl's not-found shape. - ``MutantProteinFragment.predicted_effect``: lazy-import ``varcode.errors.ReferenceMismatchError`` and catch ``(ReferenceMismatchError, ValueError, KeyError)``. Other exceptions propagate so genuine bugs aren't swallowed. 3. Aggregate INFO/WARN summary for transcript-resolution failures in ``ranked_from_lens_predictions`` and ``ranked_from_pvacseq_predictions``: a single end-of-load WARN covers (a) "had IDs but no --ensembl-release" and (b) "had IDs but release mismatch — N/M didn't resolve." The per-row debug logs stay so the individual IDs are still inspectable. End-to-end on Pt02 LENS without ``--ensembl-release`` now emits one clear WARN pointing the user at the fix instead of silently rendering "—" everywhere. 4. ``_KNOWN_VACCINE_TYPES`` set deleted; ``_resolve_vaccine_type`` now derives known types from ``_VACCINE_TYPE_DISPATCH`` directly. Adding a new vaccine writer is one registration, not two. 5. Dead intermediate (``variants = [v for v, _ in ranked]; n_total = len(variants)``) replaced with ``len(ranked)``. ``num_variants_with_vaccine_peptides`` rolled into the same loop instead of a duplicate ``sum(...)``. Comment pinned that the "first VP carries the transcript" assumption holds because LENS / pVACseq loaders emit one VP per variant — flag if that ever changes. 6. ``_cleavage_probs_via_subprocess`` docstring now notes the ~1–2s per-call torch-import cost (one launch per ``annotate_processing``, batched across all sources) and points at openvax/mhctools#200 for the upstream fix. 7. New test coverage: - ``_resolve_transcripts``: version-suffix stripping, drop-on- unresolvable, no-genome short-circuit (3 tests, ``_StubGenome`` fixture). - ``_patient_info_from_external``: proxy-count semantics + empty- ranked path. - ``MutantProteinFragment.predicted_effect``: returns None on empty transcripts AND when varcode raises ``ReferenceMismatchError`` for every transcript. - ``--ensembl-release`` reachable through the external-input parser (regression guard). 669 tests pass (was 661; +8 from this commit). * pVACseq: strip chr prefix in _parse_pvacseq_id (matches LENS fix) The earlier ``chr``-prefix fix on the LENS path was missing its symmetric pVACseq counterpart. ``_parse_pvacseq_id`` builds Variants with ``normalize_contig_names=False``, so a pVACseq ID like ``chr1-100000-100001-A-T`` would land on the Variant as ``contig='chr1'`` and silently break ``effect_on_transcript`` against pyensembl (which uses bare contigs). Same one-liner strip as ``_parse_variant_coords``; tests updated to assert the bare-contig invariant on both parsers. Also fix the ``check_args`` error wording — it claimed the listed flags were "--output-* flags," but the list now includes ``--vaccine-output`` which doesn't carry the prefix. 669 / 669 pass. * Pre-flight ensembl-release hint, subprocess module, README rewrite, 2.17.0 Round-tripping the /review output: 1. Subprocess body extracted from inline ``_SUBPROCESS_SCRIPT`` into importable ``vaxrank/_pepsickle_subprocess.py`` with a ``main()`` entry. The body is unchanged (still pure ``mhctools.Pepsickle`` I/O), but it's now testable without spawning a subprocess — ``test_pepsickle_subprocess_main_io_contract`` exercises the JSON in/out contract via patched mhctools, and ``test_pepsickle_subprocess_main_returns_2_when_mhctools_missing`` pins the import-error exit-code path. Wrapper exists only because openvax/mhctools#200 hasn't shipped subprocess isolation yet. 2. Ensembl-release inference + pre-flight hint. New ``infer_genome_build_from_lens`` reads the LENS file's ``origin_descriptor`` column and returns ``GRCh37`` / ``GRCh38`` / None based on Hsap37 / Hsap38 markers (LENS ERV-row format). ``check_args`` calls it at startup when external input is paired with template-report flags but no ``--ensembl-release`` is set, and emits a single WARN that names the inferred build and a plausible release. Verified live on Pt02: WARN now fires before any LENS load, naming GRCh38 + release 102 specifically. 3. Doc + version updates: - README.md: every ``--output-peptide`` / ``--output-mrna`` example replaced with the new ``--vaccine-output`` form; ``--vaccine-type`` documented as single-valued (was multi-valued); LENS-driven full-report example added with ``--ensembl-release``. - vaxrank/version.py: 2.16.0 → 2.17.0 (additive features + breaking CLI: removed --output-peptide / --output-mrna / --output-peptide-manifest / --output-peptide-order-form / --output-mrna-manifest / --output-mrna-csv / --output-mrna-csv-no-full-rows). 4. New tests: - subprocess I/O contract (2 tests, no pepsickle dependency). - ``infer_genome_build_from_lens``: GRCh38 / GRCh37 / unknown (3 tests). - ``args.genome`` plumbing on the external path: ``parse + _resolve_ensembl_release → args.genome is EnsemblRelease(75)``. 675 / 675 tests pass. * Pepsickle: drop in-tree subprocess wrapper; use mhctools 3.13.3 isolation mhctools 3.13.3 (openvax/mhctools#201) ships built-in ``Pepsickle(isolate_subprocess=True)`` so the libomp-clash workaround collapses from a vaxrank-side launcher + bespoke subprocess module to a single kwarg. Removed: - ``vaxrank/_pepsickle_subprocess.py`` (the pure-mhctools subprocess body lifted to ``openvax/mhctools/_pepsickle_subprocess.py``). - ``processing._cleavage_probs_via_subprocess`` and ``_SUBPROCESS_MODULE`` (was launching the deleted module). - The two unit tests that exercised the local module's I/O contract. Added: - ``processing._load_default_predictor(human_only, threshold)`` — thin builder that constructs ``Pepsickle(isolate_subprocess=True)`` and degrades to None with a logged warning if mhctools isn't importable. ``annotate_processing`` now resolves a default predictor through this helper and calls ``cleavage_probs`` directly on it (same loop body that the test-seam ``predictor=`` arg has always used). - ``test_load_default_predictor_returns_none_when_mhctools_missing`` pins the graceful-degrade path. Bumped: - ``requirements.txt``: mhctools >=3.13.3 (was >=3.13.1). End-to-end on Pt02 LENS: 2105/2113 EpitopePredictions annotated across 489 unique sources — same numbers as before the refactor, confirming the subprocess invocation moved upstream without changing results. Test count: 674 (was 675; net −1 from removing the two local-subprocess tests + adding one default-predictor smoke). * Fix pepsickle perf cliff; promote ensembl-release WARN to pre-flight error; verify antigen kinds in skip logs Three independent improvements that all came out of running PR #262 on real Pt02 data with auto-mode logging. 1. Perf cliff: ``annotate_processing`` was calling ``predictor.cleavage_probs(source)`` per unique source. Under ``Pepsickle(isolate_subprocess=True)`` that's a fresh subprocess PER call (mhctools 3.13.3's ``cleavage_probs(s)`` delegates to ``cleavage_probs_many([s])`` internally) — ~1-2s startup × 489 sources turned a 30s run into ~15 minutes "stuck." Switch to ``cleavage_probs_many`` for one batched subprocess invocation. End-to-end on Pt02: 3.2s total (was effectively hung). Stub predictors that only implement ``cleavage_probs`` still work via the fallback per-source loop. 2. ``--ensembl-release`` requirement promoted from late warning to early error. External input + template-report flag (ASCII / HTML / PDF) without ``--ensembl-release`` would silently render a degraded report with empty effect annotations everywhere. ``check_args`` now raises ValueError pre-flight with the build-inferred release suggestion (``--ensembl-release 102`` when ``origin_descriptor`` says GRCh38, etc.). The redundant loader-level WARN is gone — the release-mismatch case is still logged from the loader since pre-flight can't predict resolution success. 3. Empty-coords / no-gene / no-transcript skip logs now show the actual ``antigen_source`` breakdown instead of asserting "typical for X". Pt02 reveals the empty-coords skip is genuinely composed of CTA/SELF=276, ERV=209, SPLICE=126, FUSION=2 — every kind that's expected to lack genome coords. When SNV / INDEL rows lack these fields, a separate WARN surfaces it as a likely upstream LENS bug rather than burying it in the breakdown. 4. ``annotate_processing`` now emits a single aggregate WARN summarizing peptides skipped because they aren't substrings of their pep_context source (real LENS-data anomaly: 8/2113 on Pt02 — peptide and pep_context came from different isoform/annotation snapshots; see prior analysis). The WARN includes a representative (peptide, pep_context) pair the user can paste into an upstream bug report. Test count: 676 (was 675; +2 new tests for the early-error and peptide-not-in-context paths, −1 fixed test that pinned the now- deleted late warning). * Pre-flight ensembl-release suggestion: list installed releases instead of hardcoding 102 Previously the pre-flight error message hardcoded a suggestion of ``--ensembl-release 102`` for GRCh38 inputs. That number was arbitrary — picked once during development as "stable mid-2020 default" — and ``origin_descriptor`` only tells us the *build* (GRCh37 vs GRCh38), not the release. GRCh38 spans Ensembl 76 through ~114+. Replace the hardcoded mapping with ``installed_ensembl_releases_for_build``, which walks the pyensembl cache (``<platformdirs>/pyensembl/<build>/ensembl<N>``) to enumerate releases the user actually has on disk. The pre-flight error now lists those releases and suggests the latest, with an explicit "origin_descriptor doesn't pin a specific release" caveat so the user knows the suggestion is best-effort, not derived from the file. Three branches: - Local cache has releases for this build → list them, suggest the latest as a quick-start. - Cache empty AND build is GRCh37 → 75 is canonical (final mainline release); suggest installing it. - Cache empty AND build is GRCh38 → no canonical release; acknowledge that and ask the user to pick one matching the LENS file's source build, with an install command. Live verification on Pt02 (29 locally-installed GRCh38 releases): the message now says "you have these GRCh38 releases installed: [77, 78, ..., 113, 114]; try --ensembl-release 114" — concrete, correct, and points at something the user can actually run without first downloading data. 677 / 677 tests pass; +1 test pinning the cache-walk helper. * LENS report quality: inputs section, DNA VAF, dedup epitopes, SLP-windowed fragment Four fixes from running the LENS-driven ASCII report on real Pt02: 1. **Inputs section**: ``PatientInfo`` gains an ``inputs: list[(label, path)]`` field rendered verbatim by the report builder. The external (LENS / pVACseq) path now labels its file "LENS report: …" or "pVACseq report: …" instead of misleadingly stuffing it into ``vcf_paths`` and rendering as "VCF (somatic variants) path(s)". The legacy ``vcf_paths``/``bam_path`` fields stay for backward compat but are no longer overloaded by external paths. 2. **DNA VAF**: read LENS's ``vaf`` column (sits between ``mhcflurry_agretopicity`` and the DNA-clonal columns ``totcopynum`` / ``multiplicity`` / ``ccf``; tracks DNA VAF on real Pt02 data — 0.141 vs computed RNA VAF 0.138). Plumb through ``dna_vaf_by_variant`` so the report's "DNA VAF" line populates instead of "n/a". pVACseq path likewise reads its explicit ``DNA VAF`` column (separate from ``RNA VAF``). ``ranked_from_*_predictions`` now returns ``(ranked, dna_vaf_by_variant)`` so the dispatcher can thread it through. 3. **Dedup epitope rows**: real LENS files have multiple rows per ``(peptide, allele)`` when the same peptide is encoded by several transcripts (~5% of Pt02 rows). The loader was emitting one ``EpitopePrediction`` per row and ``ranked_from_lens_predictions`` ``extend``ed them all into the VP, producing duplicate rows in the per-VP epitope table (one annotated by pepsickle, one not, because annotation iterates by id()). Dedup in the loader by a content key (peptide, allele, ic50, percentile_rank, prediction_method_name) before assembling the VP. 4. **Vaccine peptide too long → SLP window**: the LENS path was using ``pep_context`` directly as ``MutantProteinFragment.amino_acids``, but LENS sometimes emits a 100+ aa protein-prefix as pep_context. Vaccine peptides rendered as 100+ aa instead of the canonical 25mer. Add ``MutantProteinFragment.slp_window_around_mutation(...)`` — centers a target-length window on the mutation span — and use it from the LENS loader. The pipeline path gets correctly-sized fragments by construction (Isovar emits exactly the requested width); the LENS path now lands on the same shape via this shared helper instead of inheriting LENS's variable pep_context length. ``--vaccine-peptide-length`` (default 25) threads through ``load_external_ranked``. End-to-end on Pt02 (``--ensembl-release 113``): - Patient header: "LENS report: …" - DNA VAF: 0.141, RNA VAF: 0.138 (distinct) - Vaccine peptide: 25aa (was 103aa) - One row per (peptide, allele) in the epitope table 677 / 677 tests pass. * PatientInfo unification: pipeline path also uses inputs; LENS infers MHC alleles Two improvements that share more code between VCF/BAM and external (LENS / pVACseq) paths: 1. Pipeline path now populates ``PatientInfo.inputs`` with ``[("VCF (somatic variants)", path), ("BAM (RNAseq reads)", bam)]`` so both code paths feed the same renderer with the same shape. The legacy ``vcf_paths`` / ``bam_path`` fields stay populated for backward-compat with previously-saved JSON; new code reads ``inputs`` exclusively. 2. ``_patient_info_from_external`` now takes ``predictions`` and infers the unique MHC alleles from them — LENS / pVACseq don't carry an explicit alleles header, but the alleles are implicit in the per-row predictions. The patient-info block now shows them instead of leaving "MHC alleles:" blank, with an explicit "(inferred from report)" suffix so the reader knows the source. 3. Dropped the "— origin_descriptor doesn't pin a specific release" qualifier from the pre-flight error per user feedback; the "(or whichever matches the build LENS used)" is sufficient. 677 / 677 tests pass. * Report quality: name predictor, processing-agnostic columns, geometric-mean composite, modality-aware manufacturability Five report-quality changes from real-Pt02 review: 1. **MHC predictor visible per row**: new ``Predictor`` column in the per-VP epitope table (mhcflurry / netmhcpan / etc.). LENS files can be multi-predictor; pipeline runs are usually single-predictor but consistent rendering doesn't hurt. 2. **Score column header named for clarity**: "Score" → "Score (affinity, logistic IC50)". The score is a logistic transform of IC50, not mhcflurry's presentation_score / EL — vaxrank computes it locally so it's comparable across predictors. The longer header eliminates the ambiguity. 3. **Predictor-agnostic processing columns**: column headers "Pepsickle C-term cut" / "Pepsickle max internal cut" / "Pepsickle processing score" → "Processing: C-term" / "Processing: max internal" / "Processing: combined". The underlying field names stay ``pepsickle_*`` so a future per- position predictor (NetChop, PAProC) can land alongside without collision; the active predictor is named once in the patient-info header as "Processing predictor: pepsickle". 4. **Composite score → geometric mean**: ``pepsickle_processing_score`` was ``c_term * (1 - max_internal)`` (mhctools' canonical ``score_cterm_anti_max_internal``). Switch to ``sqrt(c_term * (1 - max_internal))`` — the geometric mean of the two factors. Less-aggressive penalty when both factors are mid-range; a row with c_term=0.6 and (1 - max_internal)=0.6 now scores ~0.6 instead of 0.36, which matches how downstream readers interpret the combined as a "credibility tag" rather than a strict joint probability. Pt02 example: c_term=0.84, max_internal=0.38 → was 0.52, now 0.72. 5. **Modality-aware manufacturability default**: the GRAVY / Cys- content / N-terminal-Q manufacturability section in template reports defaults on for ``--vaccine-type=peptide`` (relevant to peptide synthesis) and off for ``--vaccine-type=mrna`` (those features don't apply to mRNA constructs). Explicit ``--include-manufacturability-in-report`` / ``--no-manufacturability-in-report`` still override. 677 / 677 tests pass. * Generic, modality-aware reports: drop peptide-coded language; surface vaccine type Cleanup of the report-generation surface so the same flags (``--output-ascii-report`` / ``--output-html-report`` / ``--output-pdf-report`` / ``--output-csv`` / ``--output-xlsx-report`` / ``--output-json-file`` / ``--output-neoepitope-report``) work across every ``--vaccine-type`` and adapt their content to the active modality, rather than always rendering peptide-specific material. Visible changes on Pt02 LENS: - Patient header now opens with "Vaccine type: peptide" / "Vaccine type: mrna" so the rendered modality is explicit. - Section heading "Vaccine Peptides:" → "Vaccine antigens:" — same content, modality-neutral name. - Manufacturability section (GRAVY, Cys content, N-terminal Q/E/C, Asp-Pro bonds) renders by default for ``--vaccine-type=peptide`` and is suppressed for ``--vaccine-type=mrna`` (those metrics are about peptide synthesis; mRNA constructs translate in vivo). Explicit ``--include-manufacturability-in-report`` / ``--no-manufacturability-in-report`` still override. Help-text rewrites: every "vaccine peptide report" reference in ``arg_parser.py`` is now "summary report" (or similar) with an explicit "antigen-centric; same flag for all --vaccine-type modes" note. Future modalities (DNA, etc.) plug in by extending ``--vaccine-type`` choices and registering their writer in ``_VACCINE_TYPE_DISPATCH``; no rename of any output flag is needed. Filed openvax/vaxrank#269 for the related but separate work of HLA-balanced antigen packing across multi-construct mRNA vaccines — not included here because it changes assembly semantics and needs its own evaluation against the existing greedy strategy. 677 / 677 tests pass. * Multi-mode --vaccine-type + --output-dir; canonical filenames per modality Reworked the vaccine-type / output API to support designing multiple modalities in one run while keeping the single-mode case flat: CLI changes (breaking): - ``--vaccine-type`` is multi-valued again (``nargs='+'``, default ``['peptide']``). Previous single-mode runs still work (one value); multi-mode runs accept e.g. ``--vaccine-type peptide mrna``. - Removed ``--vaccine-output``, ``--vaccine-manifest``, ``--vaccine-order-form``, ``--vaccine-csv``, ``--vaccine-csv-no-full-rows``. The path-companion flags went away because they couldn't disambiguate in multi-mode (which manifest? peptide's or mrna's?). - Added ``--output-dir DIR``: always a directory, never a file path, no extension-based mode switching. Vaxrank picks canonical filenames inside. - Added ``--mrna-csv-no-full-rows`` (the boolean knob from the removed ``--vaccine-csv-no-full-rows``; now lives in the mrna knob group). Output layout: - Single-mode (one --vaccine-type): canonical files land directly in ``--output-dir``. peptide → vaccine.fasta + manifest.json + order_form.csv mrna → cds.fasta + no_polyA.fasta + full.fasta + manifest.json + layers.csv - Multi-mode (≥2 types): per-modality subdirs so canonical filenames don't collide on ``manifest.json``. DIR/peptide/... DIR/mrna/... Internals: - ``_resolve_vaccine_type`` → ``_resolve_vaccine_types`` (returns ordered, deduplicated list). Bare-string callers still tolerated. - New ``_vaccine_target_dir(output_dir, vtype, all_active)`` — shared "where do my files go" helper. Single-mode flat, multi-mode subdir. - ``_emit_outputs`` iterates active types; each writer takes ``(args, ranked, target_dir)`` and uses canonical filenames inside. - Manufacturability default tightened: was "on for vaccine_type=peptide", now "on iff 'peptide' in active types". Mixed peptide+mrna runs include manufacturability (peptide is active); mrna-only runs omit it. Verified end-to-end on Pt02 LENS: - ``--vaccine-type peptide --output-dir out/`` → flat layout with vaccine.fasta + manifest.json + order_form.csv - ``--vaccine-type peptide mrna --output-dir out/`` → subdir layout: out/peptide/{...} + out/mrna/{...}, both written successfully Test churn: ``_mrna_args`` rebuilt around the new fields; ``_resolve_vaccine_types`` test pins multi-mode + dedup + first-occurrence order; new ``_vaccine_target_dir`` test pins the single-vs-multi branching; new ``test_lens_with_multi_vaccine_type_uses_subdirs`` end-to-end test; old ``test_emit_outputs_skips_writer_when_no_vaccine_output`` renamed to ``..._no_output_dir``. README + arg_parser docstrings updated. 679 / 679 tests pass. * Default --vaccine-type to 'peptide mrna'; address /review findings Default change: ``--vaccine-type`` now defaults to ``['peptide', 'mrna']`` (was ``['peptide']``). Both modalities share the same ranking, pepsickle annotation, and load work; only the construct-assembly step differs. Designing both by default matches what users typically want and leaves single-modality runs to an explicit ``--vaccine-type peptide`` (or mrna). /review findings addressed: #2 Reject existing-file paths and FASTA-like suffixes in ``--output-dir`` early via ``check_args`` → new ``_reject_output_dir_misuse`` helper. Catches ``--output-dir out.fasta`` / ``--output-dir existing_file.txt`` before any ranking work happens, with an actionable error message. #3 Multi-mode partial failures: ``_emit_outputs`` now wraps each writer in try/except and logs ERROR + continues, so an mrna assembly bug doesn't silently abort a peptide run (or vice versa). Other modalities still get attempted; final dispatch line records what fired. #4 Cleaner dispatch log: ``wrote=[]`` instead of ``wrote=['(none)']``. Anyone parsing the line gets a real Python list instead of a list-with-sentinel. #5 Manufacturability resolution dedups the shape-tolerance code: ``args.manufacturability = 'peptide' in _resolve_vaccine_types(args)``. One source of truth for the type-list parsing. #9 ``--output-csv`` help text now spells out the format duality: pipeline path emits per-variant rows; LENS / pVACseq emit per-(peptide, allele) rows. Antigen-centric in both cases; doesn't change with --vaccine-type. #10 New ``test_manufacturability_default_on_for_peptide_only`` pins the modality-aware default for each combination of --vaccine-type, plus ``test_manufacturability_explicit_flag_overrides_default`` confirms ``--include-manufacturability-in-report`` / ``--no-manufacturability-in-report`` still win. #7 Test-fixture refactor: ``_mrna_args`` → ``_vaccine_args`` covering both peptide and mrna knobs (alias for backward-compat in tests that already say ``_mrna_args``). The multi-mode test collapses from 18 lines of manual attribute injection to one call. End-to-end on Pt02: default ``vaccine_type=['peptide', 'mrna']`` with ``--output-dir vaccines/`` writes vaccines/peptide/{...} and vaccines/mrna/{...}; without ``--output-dir`` the dispatch line shows ``wrote=[]`` and only requested analysis reports get written. 681 / 681 tests pass. * YAML config: vaccine_constructs.{shared,peptide,mrna} with CLI override New top-level YAML section ``vaccine_constructs`` that drives construct-assembly knobs. Layout: vaccine_constructs: shared: # cross-modality defaults (length bounds, antigen_content, …) peptide: # peptide-specific (mode, linker, n-term acetyl, …) mrna: # mRNA-specific (signal_peptide, UTRs, polyA, junctions, …) Resolution order (highest precedence first): explicit CLI flag > per-modality config > shared config > built-in default Implementation: - ``vaxrank/config/schema.py``: new ``VaccineConstructsConfigSchema`` with ``_SharedConstructConfigSchema``, ``_PeptideConstructConfigSchema``, ``_MrnaConstructConfigSchema`` subschemas. msgspec validates at YAML-load time so unknown keys (typos) error fast. Knobs whose defaults differ between modalities (linker, antigens_per_construct, max_constructs) live in the per-modality subsections; only knobs whose defaults are the same go in shared. - ``vaxrank/config/loader.py``: new ``extract_construct_kwargs(config, modality)`` merges ``vaccine_constructs.shared`` with ``vaccine_constructs.<modality>``; modality-specific values override shared values. None entries are dropped so callers can ``dict.get(key, fallback)`` cleanly. - ``parse_vaxrank_args`` now snapshots parser defaults into ``parsed._parser_defaults`` so downstream code can detect "user explicitly passed this flag" by comparing ``args.X`` to the snapshot. - ``_emit_peptide_constructs`` / ``_emit_mrna_constructs`` consume YAML config via a small ``cfg(cli_attr, yaml_key)`` closure that resolves through the precedence chain. Every CLI flag they take is now config-overridable; CLI still wins when explicit. - ``vaxrank/config/default.yaml``: new section with a complete commented-out template showing every knob and the layered ``shared`` / ``peptide`` / ``mrna`` structure. Default config is still a no-op — uncomment knobs to change behavior. Tests: - Four new ``test_extract_construct_kwargs_*`` tests pin the merge semantics (shared-only, modality-override, None-dropping, absent-section). - ``test_construct_config_overrides_cli_default`` end-to-end: ``vaccine_constructs.peptide.linker: AAY`` drives the writer when the user runs without ``--peptide-linker``, and CLI ``--peptide-linker EAAAK`` overrides the YAML. Live verification on lens_example.tsv: a config setting ``vaccine_constructs.peptide.linker: AAY`` plus ``antigens_per_construct: 3`` packs TP53 + BRAF + KRAS antigens into one construct with AAY linkers visible in the FASTA output — neither value passed on the CLI. 686 / 686 tests pass. * Multi-mode dispatch: any writer failure aborts the whole run Reverted the try/except + continue from the prior /review fix. Partial vaccine output is worse than no vaccine output: ending up with a peptide FASTA on disk but no mRNA directory (or vice versa) is the kind of half-state that quietly ships to a downstream reviewer. ``_emit_outputs`` now lets writer exceptions propagate so the run fails loud. The pre-raise log line names which modality crashed and which already wrote — enough operator context to decide whether to clean up partial files before retrying. Single-modality runs are unaffected (one writer; failure already bubbled). 686 / 686 tests pass. * Rename vaccine_constructs.shared → defaults ``shared`` reads weirdly in single-modality runs ("shared with what?"). ``defaults`` reads cleanly at any --vaccine-type count: "default values; per-modality blocks override them." Same merge semantics as before: vaccine_constructs: defaults: # values applied to every modality not overriding them peptide: # peptide-specific overrides mrna: # mRNA-specific overrides Resolution order unchanged: explicit CLI flag > per-modality config > defaults config > built-in default Schema, loader, default.yaml, tests all renamed in lockstep. No behavior change. 686 / 686 tests pass. * Flatten vaccine_constructs: cross-modality knobs at section top level User's point: ``defaults:`` was a misnomer. Those values aren't "defaults" the user is overriding — they're the actual values for the run. Drop the subsection wrapper; cross-modality knobs go directly under ``vaccine_constructs:``. Modality subsections (``peptide:`` / ``mrna:``) are only for modality-specific overrides or modality-only knobs. Before: vaccine_constructs: defaults: max_antigen_length_aa…

iskandr added 2 commits May 2, 2026 22:14

Add pepsickle subprocess isolation

5c5c08d

Preserve generator peptides in processing predictions

ddd8d2f

iskandr marked this pull request as ready for review May 4, 2026 13:02

iskandr merged commit 7823207 into master May 4, 2026
4 checks passed

iskandr deleted the pepsickle-subprocess-isolation branch May 4, 2026 13:03

iskandr mentioned this pull request May 4, 2026

Align deploy.sh with documented release workflow #202

Closed

iskandr mentioned this pull request May 12, 2026

Add NetCleave integration for MHC-I/II C-terminal cleavage prediction #213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Add pepsickle subprocess isolation#201

[codex] Add pepsickle subprocess isolation#201
iskandr merged 2 commits into
masterfrom
pepsickle-subprocess-isolation

iskandr commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iskandr commented May 3, 2026

Summary

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant