[codex] Add pepsickle subprocess isolation#201
Merged
Conversation
iskandr
added a commit
to openvax/vaxrank
that referenced
this pull request
May 4, 2026
…tion mhctools 3.13.3 (openvax/mhctools#201) ships built-in ``Pepsickle(isolate_subprocess=True)`` so the libomp-clash workaround collapses from a vaxrank-side launcher + bespoke subprocess module to a single kwarg. Removed: - ``vaxrank/_pepsickle_subprocess.py`` (the pure-mhctools subprocess body lifted to ``openvax/mhctools/_pepsickle_subprocess.py``). - ``processing._cleavage_probs_via_subprocess`` and ``_SUBPROCESS_MODULE`` (was launching the deleted module). - The two unit tests that exercised the local module's I/O contract. Added: - ``processing._load_default_predictor(human_only, threshold)`` — thin builder that constructs ``Pepsickle(isolate_subprocess=True)`` and degrades to None with a logged warning if mhctools isn't importable. ``annotate_processing`` now resolves a default predictor through this helper and calls ``cleavage_probs`` directly on it (same loop body that the test-seam ``predictor=`` arg has always used). - ``test_load_default_predictor_returns_none_when_mhctools_missing`` pins the graceful-degrade path. Bumped: - ``requirements.txt``: mhctools >=3.13.3 (was >=3.13.1). End-to-end on Pt02 LENS: 2105/2113 EpitopePredictions annotated across 489 unique sources — same numbers as before the refactor, confirming the subprocess invocation moved upstream without changing results. Test count: 674 (was 675; net −1 from removing the two local-subprocess tests + adding one default-predictor smoke).
iskandr
added a commit
to openvax/vaxrank
that referenced
this pull request
May 5, 2026
…otation (#262) * Pepsickle credibility tagging (#249) + LENS field-format fixes (#259, #260) #249 — Pepsickle credibility tagging ------------------------------------ mhcflurry-presentation already includes a flank-aware processing prior, so a per-k-mer "presentation score" implicitly captures whether the ligand would be cleaved out and presented. What it can't tell us: whether the proteasome would cut INSIDE the ligand (destroying it before MHC) or whether the C-terminal cut at the ligand's boundary is clean. Pepsickle (Weeder et al., Bioinformatics 2021) gives per-position cleavage probabilities — we use that as a credibility filter on existing MHC ligand predictions, not as a separate score axis. Each EpitopePrediction gains three optional fields, populated when ``--processing-aware-annotation`` is enabled (default: on): - c_term_cleavage_prob: prob of clean release at the C-terminus - max_internal_cut_prob: peak cleavage prob strictly inside the ligand (high → ligand likely destroyed before MHC) - processing_score: composite c_term × (1 − max_internal); 1.0 ideal Annotation is purely additive — never alters ranking. Surfaced in HTML / PDF / ASCII per-epitope tables as three extra columns when populated; absent reports keep their original 6-column shape. New module: vaxrank/processing.py - annotate_processing(predictions, predictor=None) — runs pepsickle once per unique source_sequence (not once per peptide); the predictor is loaded lazily and degrades gracefully if unavailable - _per_position_processing / _composite_processing_score helpers CLI flag: --processing-aware-annotation / --no-processing-aware-annotation (default ON). #259 + #260 — LENS field-format bug fixes ------------------------------------------ Surfaced while running on a real LENS v1.9-dev file (Hugo IPRES Pt10): #259: variant_coords parser only handled 4-part chr:pos:ref:alt; real LENS files emit 2-part chr:pos (no ref/alt), NaN (non-SNV antigen rows: splice / fusion / intron retention), and the literal string "nan". Pre-fix: 3000+ warnings, 0 ranked entries. Post-fix: parses all real-LENS forms, NaN rows skipped silently (info-level summary instead of per-row warning), unparseable rows aggregated into one warning. Placeholder ref/alt nucleotides ('A'/'C') used when LENS omits them — varcode rejects 'N' and empty strings. #260: write_neoepitope_report raised ValueError on duplicate (peptide, allele) rows. Real LENS files emit duplicates (same neoepitope from multiple transcripts / homologous regions). The score-merge already broadcasts the same score across duplicate rows correctly; the strict-uniqueness check was overly strict. Replaced with an info-level note documenting the duplicate count. Tests (+19 new) --------------- - test_processing.py: 14 tests covering helpers, annotation, ranking invariance, multi-source one-pass-per-source, off-by-one offset re-location, predictor failure degradation, and CLI flag default. - test_external_input.py: 5 new tests covering the relaxed variant_coords parser (2-part, 3-part, NaN, 'N' rejection) and a regression test against the real-LENS-v1.9 fixture. - test_epitope_io.py: replaced the strict-duplicates rejection test with one that pins score-broadcasting across duplicate rows. Real-world end-to-end: a real Hugo IPRES Pt10 LENS file now produces both --output-csv and --output-neoepitope-report XLSX without errors (was crashing pre-fix). Bumps 2.15.0 → 2.16.0. Subsumes #259, #260. * Review fixes: HTML header drift, placeholder-genotype provenance, closest-occurrence re-location, dedup contract, pepsickle CLI passthrough #1 (correctness) HTML/PDF/ASCII per-epitope table header drift -------------------------------------------------------------- Pre-fix: _epitope_data emitted a 6-key dict for unannotated predictions and a 9-key dict for annotated ones. The HTML template builds the header from p.epitopes[0].keys() — if the first epitope happened to be unannotated and a later one was annotated, the header had 6 columns but rows had 9 → malformed table. This was reachable any time pepsickle annotation succeeded for some sources but failed for others (graceful per-source skip is a documented behavior). Fix: _epitope_data takes an ``include_processing`` parameter; the caller computes ``any(c_term is not None for p in mutant_predictions)`` per VaccinePeptide and passes the result. When True, every row gets the 9-column shape (with '—' fallback for unannotated predictions), matching the header. When False, the legacy 6-column shape is returned unchanged so unannotated reports don't change. #2 (provenance) ref_alt_fallback placeholder genotype ----------------------------------------------------- _parse_variant_coords now returns ``(Variant, alleles_real)`` so callers can detect when the genotype is fictional (placeholder ref/alt because LENS only emitted chr:pos). When alleles_real is False the synthesized Variant must NOT be fed to varcode-effect annotation — its genotype isn't real biology. Documented in the docstring + an info-level summary log line counts how many such placeholders were used per LENS run. Construct assembly only uses (chr, pos) and is unaffected. #3 (correctness) Re-location picks the occurrence closest to the declared offset, not the first match ------------------------------------------------------------------- When the declared offset of a peptide doesn't match the source at that position, processing.py re-locates by substring search. Pre-fix this used source.find() — always returning the first occurrence even if the declared offset pointed at a later match (e.g. homopolymer tracts with repeated short ligands). New _closest_occurrence helper scans all matches and picks the one with smallest |pos - declared|. A drift > 3 positions emits a warning so upstream loaders with broken offset accounting are flagged. #4 (UX) CLI passthrough for pepsickle parameters ------------------------------------------------- New flags: --pepsickle-human-only (use the human-only-trained model) and --pepsickle-threshold N (cleavage probability threshold). Threaded through annotate_processing → _load_default_predictor. #6 (clarity) Dup-rows log message ---------------------------------- Was "from multi-source LENS / pVACseq input"; pipeline-mode duplicates (if a future loader bug introduces them) would be misattributed. Now "from multi-source input." #7 (debuggability) _load_default_predictor warns + debug-traceback ------------------------------------------------------------------- Pepsickle import / instantiation failures previously logged a single warning line. Now also logger.debug(exc_info=True) so genuine bugs (bad CUDA libs, torch version mismatch) surface at DEBUG without spamming WARNING. #8 (robustness) Dedup contract in _annotate_predictions_with_processing ------------------------------------------------------------------------ Pre-fix dedup keyed on id(p), which assumes the same EpitopePrediction object is reachable from both the ranked-vaccine-peptides intermediate and the LENS predictions list. That's true for the current external_input loader but a future loader that copies predictions into VaccinePeptides would break the assumption and double-annotate. Belt-and-suspenders: also dedup on the content tuple (peptide_sequence, allele, source_sequence, offset). Both checks are O(1). Tests (+5 new) -------------- - test_epitope_data_header_consistent_when_some_predictions_unannotated: pins #1 — annotated and unannotated rows produce identical key sets when include_processing=True - test_re_location_picks_closest_to_declared_offset: pins #3 — homopolymer source where declared offset points at the second occurrence; re-location must snap there, not to the first - test_re_location_warns_on_large_offset_drift: pins #3's >3aa warning - test_dedup_by_content_when_duplicate_objects: pins #8 — two distinct Python objects with the same (peptide, allele, source, offset) collapse to a single annotation pass - test_pepsickle_cli_param_passthrough: pins #4 — --pepsickle-human-only / --pepsickle-threshold parse correctly Plus updated tests for the (Variant, alleles_real) tuple return shape of _parse_variant_coords (5 existing tests adapted; one new alleles_real assertion per case). 667 tests pass; 93% coverage. * Stop-codon truncation, placeholder-genotype propagation, parametrized real-LENS CI, more review fixes LENS data quality (real-world Pt02 file from HugoLo IPRES 2016) --------------------------------------------------------------- Pt02 surfaced a third LENS bug: pep_context can contain '*' (stop codon) for stop-loss / readthrough variants. Pre-fix this crashed manufacturability scoring with KeyError: '*'. Fix: new helpers in vaccine_library.py: - truncate_at_stop_codon(aa) — stop at first '*'; everything after doesn't exist as protein. Translation halts at the stop. - has_only_standard_amino_acids(aa) — guard against selenocysteine 'U', pyrrolysine 'O', ambiguous 'X' / 'B' / 'Z' / 'J', etc. external_input.ranked_from_lens_predictions now truncates pep_context + peptide at first stop and drops rows where the result has any non-standard residue, with a warning. New fixture: tests/data/epitope_fixtures/real_lens_subsets/ lens_v1.9_with_stop_codons.tsv (52 rows from a real Pt02 dump) covers stop-codon, NaN variant_coords, multi-row-per-variant, and 2-part chr:pos shapes in one file. Parametrized end-to-end CI (no more "shouldn't have to run manually") ---------------------------------------------------------------------- test_real_lens_fixture_runs_end_to_end_via_cli is parametrized over every TSV in tests/data/epitope_fixtures/real_lens_subsets/. Each fixture drives main() with --input-lens / --output-csv / --output-neoepitope-report end-to-end. Future LENS shapes get covered by dropping a fixture in that directory; the test grows automatically. test_real_lens_fixtures_present pins the expected set. Review fixes (continuing from PR review) ----------------------------------------- #1 alleles_real propagation. Added MutantProteinFragment.placeholder_alleles (default False). LENS path sets it True when pep_context came from a 2-part chr:pos row (no real ref/alt). Documented that downstream varcode-effect annotation MUST detect this flag rather than silently do the wrong thing on the placeholder genotype. #2 pVACseq tuple consistency. _parse_pvacseq_id now returns (Variant, alleles_real) matching the LENS-path shape. pVACseq IDs always carry real ref/alt so alleles_real is True on success; the symmetric return shape lets future code handle both paths uniformly. #3 Empty peptide_sequence dedup degeneration. _maybe_add now skips content-key dedup for empty-peptide predictions (passes them through; the annotation step skips them on its own). #4 Drift threshold scales with source length: max(3, 5% × len). Typical 25-aa SLP sources keep the absolute threshold; full-protein sources allow more drift before warning. #5 annotate_processing warns when caller passes a pre-built predictor AND non-default human_only / threshold (the params are ignored — caller already configured the predictor). #7 Test refactor: monkeypatch fixture replaces manual try/finally in test_dedup_by_content_when_duplicate_objects so cleanup is automatic on test failure. Documentation ------------- README gains a Data Model section explaining VaccinePeptide / MutantProteinFragment / EpitopePrediction structure, the variant → multiple VPs → multiple constructs fan-out, and where each per-VP report renders. Removes the "what's a VP?" reviewer trip-up. Tests (+10 new) --------------- - 4× parametrized end-to-end real-LENS CLI (one per fixture) - test_real_lens_fixtures_present (coverage gate) - test_lens_pep_context_with_stop_codon_truncates - test_truncate_at_stop_codon_helper - test_has_only_standard_amino_acids_helper - test_annotate_processing_warns_when_predictor_and_params_supplied - test_drift_threshold_scales_with_source_length 677 tests pass; 93% coverage. Real Pt02 LENS file produces both CSV + XLSX outputs end-to-end (was crashing pre-fix on stop codons). * LENS: use real snv_*_allele / indel_*_allele columns instead of placeholder ref/alt The previous fix substituted A/C placeholder nucleotides when LENS variant_coords arrived as 2-part chr:pos — but LENS files actually DO carry real ref/alt, just in dedicated per-source columns: SNV rows → snv_ref_allele, snv_alt_allele ('C', '[T]') INDEL rows → indel_ref_allele, indel_alt_allele ('CA', '[C]') The brackets are LENS's notation (multi-allelic-ready, though we've seen only single-allele cases in real files). The dedicated columns are populated for every SNV / INDEL row in the patient datasets we have, so there's no need to invent placeholder genotypes. Inventing a fictional A→C variant was a hack and would have caused varcode- effect annotation downstream to do the wrong thing silently. Architecture ------------ - ``_parse_variant_coords(coords)`` now extracts only ``(contig, pos)`` — a focused chr:pos parser. NaN / empty / malformed → None. - New ``_strip_lens_allele(value)`` strips the bracket notation (``'[T]'`` → ``'T'``). - New ``_variant_from_lens_row(row, genome=None)`` is the row-aware helper: extracts (contig, pos) from variant_coords AND looks up the real ref/alt from snv_*_allele or indel_*_allele based on ``antigen_source``. Returns the real-genotype Variant or None. - ``ranked_from_lens_predictions`` now calls ``_variant_from_lens_row`` on the representative row of each variant_coords group; placeholder genotypes are never produced. - ``MutantProteinFragment.placeholder_alleles`` is now always False on the LENS path. Field retained for hypothetical future loaders that genuinely lack alleles. Real Pt02 file (Hugo IPRES 2016) end-to-end -------------------------------------------- Before: 0 ranked entries (parser couldn't fabricate alleles cleanly). After: 409 ranked entries with REAL ref/alt: chr3:150742446 A> SIAH2 (CA → C frameshift, varcode-normalized) chr17:29502684 T> TAOK1 chr6:56459264 G>A DST (missense SNV) Test fixtures ------------- Added snv_ref_allele / snv_alt_allele / indel_ref_allele / indel_alt_allele columns to the toy LENS fixtures (lens_example.tsv, lens_multi_row_per_variant.tsv, lens_v1.4_with_stability.tsv, lens_v1.9_mhcflurry_only.tsv). The 4 real-LENS subset fixtures already had these columns — no changes needed. Tests (+5 new, 4 updated) ------------------------- - test_parse_variant_coords_extracts_chr_pos: pins the new (chr, pos)- only return shape. - test_parse_variant_coords_returns_none_on_missing_or_malformed: NaN / 'nan' / '' / garbage all return None. - test_strip_lens_allele_handles_bracket_form: '[T]' / '[CA]' / 'T' / None / 'nan' edge cases. - test_variant_from_lens_row_uses_real_snv_alleles: SNV path uses snv_ref_allele / snv_alt_allele. - test_variant_from_lens_row_uses_real_indel_alleles: INDEL path uses indel_*_allele; pins the varcode normalization (CA → C becomes ref='A' alt='' at start+1). - test_variant_from_lens_row_skips_non_snv_indel: SPLICE / FUSION / CTA-SELF / ERV → None (caller skipped earlier on NaN coords anyway). - test_variant_from_lens_row_skips_when_alleles_missing: defends the hypothetical future case of an SNV row missing its allele columns. Removed: 5 tests that asserted the placeholder-genotype behavior (pre-fix). Replaced with the new shape + real-allele tests. 678 tests pass; 93% coverage. Real Pt02 LENS file produces 409 ranked entries with real biological ref/alt nucleotides. * LENS / pVACseq: stop inventing data — read every field from real columns Audit found six places where vaxrank was fabricating values rather than reading what's actually in the LENS / pVACseq files. Each fix below replaces an invented value with a lookup against the real column. Verified end-to-end on a real Hugo IPRES Pt02 LENS file: ranked entries now carry real ref/alt, gene names, RNA-read counts, transcript IDs, mutation spans, and peptide offsets — no fabrications. Inventions removed ------------------ 1. ``load_lens`` set ``offset=0`` for every prediction. Real fix: compute via ``pep_context.find(peptide)``. (LENS centers the peptide within pep_context — it's not at position 0.) 2. ``_coerce_n_alt_reads(tpm)`` used TPM as a stand-in for alt-read count. TPM ≠ read count — different units, different meanings. Real fix: read ``rna_reads_covering_genomic_origin`` (total at locus) and ``rna_reads_covering_genomic_origin_with_peptide_cds`` (alt-supporting). New helper ``_read_counts_from_lens_row`` does the lookup. ``_coerce_n_alt_reads`` deleted. 3. ``_mut_offsets_in_context`` falsely claimed the entire pep_context was the mutation span when the peptide couldn't be located in it. That told downstream code "every residue is mutated." Real fix: return ``(None, None)`` so the caller drops the row instead. 4. ``gene_name = ... or 'unknown'`` (LENS + pVACseq paths). Real fix: preserve empty string when LENS / pVACseq doesn't supply a name — the codebase convention for "not known" — and let the existing ``iter_named_antigens`` fallback handle the display. 5. pVACseq path hardcoded ``n_overlapping_reads=1, n_alt_reads=1, n_ref_reads=0``. Real fix: read ``RNA Depth`` (total coverage) and ``RNA VAF`` (variant allele frequency); compute ``n_alt = round(depth × vaf)``, ``n_ref = depth - alt``. 6. LENS path passed ``supporting_reference_transcripts=[]`` (empty). Real fix: pull ``transcript_id`` and split ``all_transcript_ids_encoding_peptide`` into a real list. Future varcode-effect annotation now has the actual transcript context. Honest "no signal" defaults --------------------------- When a real column is genuinely missing (NaN / None), default to 0 for read counts (honest "no signal") instead of 1 (a fabricated weak signal). Users with LENS files lacking RNA columns will see ``combined_score`` collapse to expression-independent ranking; that's correct behavior — we don't have read data for those rows. Documentation ------------- ``overlaps_mutation=True`` and ``occurs_in_reference=False`` in ``load_lens`` are now documented as STRUCTURAL ASSUMPTIONS implied by the fact that a row exists in a LENS report (LENS pre-curates neoepitopes and pre-filters reference matches). Not invented — the report's existence implies them. Real Pt02 file end-to-end (post-fix) ------------------------------------- 406 ranked entries, each with real fields: chr3:150742446 A>'' SIAH2 reads=181/25 transcripts=1 mut_span=89-98 chr17:29502684 T>'' TAOK1 reads=32/2 transcripts=1 mut_span=59-68 chr6:56459264 G>'A' DST reads=84/21 transcripts=1 mut_span=29-38 Pre-fix offset=0 → post-fix offset=133 (the actual position of the first peptide in its 165-aa pep_context). Tests ----- - test_mut_offsets_in_context_returns_none_when_not_found replaces test_mut_offsets_in_context_falls_back_to_full_window. Pins the drop-the-row behavior. 678 tests pass; 92% coverage. Nothing in the LENS / pVACseq loaders fabricates a value when the real column exists. * LENS: derive wt_ic50 from mhcflurry_agretopicity (don't leave it None) Audit of the LENS loader found that wt_ic50 was always None, even though LENS does emit ``mhcflurry_agretopicity`` (defined as MT_IC50 / WT_IC50). Real biological wildtype affinity can be recovered as WT_IC50 = MT_IC50 / agretopicity. Computed once per row (shared across detected predictors) and applied only to the mhcflurry-affinity prediction (the value's scale matches mhcflurry's IC50). Other predictors leave wt_ic50 None — the agretopicity is mhcflurry-specific. Real Pt02 LENS file: 786 / 2153 predictions now carry real wt_ic50 (the rest are non-SNV antigens — ERV / SPLICE / FUSION / CTA-SELF — where mhcflurry_agretopicity is NaN). Sample: SPAMIPKDWPL ic50=123.43 wt_ic50=14319.52 (strong neoepitope) KYWHIILGGGR ic50=180.59 wt_ic50=20814.20 (strong neoepitope) GRFYGLDL ic50=179.45 wt_ic50= 179.45 (silent — no improvement) Test: existing toy fixture's mhcflurry_agretopicity=0.020 + IC50=95.4 → wt_ic50 = 4770. Test updated to assert this rather than None. Follow-up issues filed for bigger gaps surfaced in the audit ------------------------------------------------------------- - #263: Surface non-SNV antigens (SPLICE / FUSION / CTA-SELF / ERV) from LENS — currently dropping ~30% of antigens silently because variant_coords is NaN for those antigen sources. - #264: HLA LOH + B2M / TAP / antigen-presentation pathway integrity from LENS — clinically essential info that determines whether predictions for an allele or pathway are even meaningful. - #265: Surface mhcflurry pres_score / pres_perc + proc_score from LENS — currently running pepsickle to compute what LENS already emits, and ignoring presentation score entirely. 678 tests pass; 92% coverage. * LENS: warn on missing essential / important per-row data; correct #265 scope Per-load summary ---------------- load_lens / ranked_from_lens_predictions now emit a per-load summary of missing data so users know what's degraded: WARNING: N / total rows lack pep_context — antigens degenerate to bare neoepitope (~9 aa instead of ~25 aa SLP). INFO: N / total rows lack gene_name (typical for ERV). INFO: N / total rows lack transcript_id (no transcript context for downstream varcode-effect annotation). WARNING: N / total rows lack RNA-read counts — combined-score ranking collapses to epitope-only for those rows. On a real Pt02 LENS file: 337 / 2153 lack gene_name + transcript_id (ERV antigens) 128 / 2153 lack RNA-read counts Issue updates ------------- - #263 (non-SNV antigens) — concrete plan added: fusion via varcode.StructuralVariant (already in varcode 4.x as BND), viral / CTA / ERV / splice as Antigen-subclass types. Land fusion first, then antigen-source tagging on MutantProteinFragment, then full Antigen abstraction. - #264 (HLA LOH + APC pathway) — confirmed vaxrank has zero existing input format for this. Two-channel proposal: (A) extract from LENS rows automatically, (B) new --patient-context JSON/YAML for the VCF/BAM path. Both populate the same PatientInfo fields. - #265 (mhcflurry pres / proc) — corrected: mhcflurry's per-peptide proc_score and pepsickle's per-position cleavage scores are COMPLEMENTARY, not substitutes. Surface both. 678 tests pass; 92% coverage. * Pepsickle: default off (libomp clash on macOS); correct length scope The macOS abort --------------- On macOS, pepsickle's torch dependency ships its own libomp.dylib. By the time vaxrank tries to load pepsickle, pandas / numpy / pyarrow have already loaded a different libomp via OpenBLAS / MKL. The second OpenMP runtime to init either aborts the process at init time (loud, OMP Error #15) or — with KMP_DUPLICATE_LIB_OK=TRUE set — segfaults later when the two runtimes collide during inference (silent, exit 139). KMP_DUPLICATE_LIB_OK=TRUE quiets the init-time abort but doesn't prevent the inference-time segfault. So: - Default --processing-aware-annotation flipped to False. - The flag's help text documents the libomp issue and points users at typical Linux installs (single OpenMP) as the green path. - Issue #266 filed for the robust fix (subprocess isolation, or ONNX-runtime alternative). Test pinning the default flipped from on → off. Length scope correction ----------------------- Earlier comment on issue #265 / processing.py docstring said "one number per 9-mer" for mhcflurry's processing score. Wrong: mhcflurry handles MHC-I peptides 8-15 aa, and pepsickle's per- position scoring is length-agnostic. Updated docstring to "one number per peptide (8-15 aa for MHC-I)" and added a footnote on ``c_term × (1 - max_internal)`` being a *conservative approximation* (strict version is ``Π(1 - p_i)``; max heuristic dominates because proteasomes cut roughly once per substrate molecule). Plus the safe-wins fixes from the earlier audit ------------------------------------------------ - LENS wt_ic50 derived from mhcflurry_agretopicity (= MT/WT ratio) for mhcflurry-affinity predictions. 786/2153 predictions in Pt02 now carry real wildtype affinity. - Per-load summary warnings for missing essential / important data: * pep_context absent → SLP collapses to bare neoepitope (warning) * gene_name absent → typical for ERV (info) * transcript_id absent → no transcript context (info) * RNA-read counts absent → combined-score collapses to epitope-only (warning, suggests --combined-score-mode=epitope_only) Pt02 produces 20 peptide constructs end-to-end with default flags; no segfaults, no crashes. Issues filed ------------ - #263 (non-SNV antigens): concrete plan with varcode.StructuralVariant for fusions, Antigen-subtype hierarchy for viral / CTA / ERV / splice - #264 (HLA LOH + B2M / TAP integrity): two-channel input-format design (extract from LENS rows automatically + new --patient-context flag for VCF/BAM path) - #265 (mhcflurry pres_score / proc_score): corrected scope — complementary to pepsickle, not a substitute - #266 (libomp clash): subprocess isolation as the robust fix 678 tests pass; 92% coverage. * Pepsickle: run in isolated subprocess; default back to on (#266) Root-cause fix for the macOS libomp clash: torch's bundled libomp collides with the parent's pandas / numpy / pyarrow OpenMP runtime, which manifested as either OMP Error #15 (abort) or a quiet exit-139 segfault during pepsickle inference. Running pepsickle in a fresh Python subprocess (only torch + pepsickle + json imported, no pandas) gives torch a clean libomp load and no clash. processing.py is a clean rewrite: - one linear flow in annotate_processing (group by source → score → apply per-peptide), no parallel paths - _cleavage_probs_via_subprocess returns {} on any failure and never raises; the caller treats missing keys as "no signal" - obsolete in-process _load_default_predictor removed With subprocess isolation in place: - --processing-aware-annotation default flips back to True - the --no-processing-aware-annotation opt-outs added to the b16, pVACseq, LENS, and real-LENS-fixture pytest paths are gone — those workarounds existed only to dodge the OMP segfault under pytest+coverage, which the subprocess no longer triggers - test_processing_aware_annotation_default_on flipped accordingly Verified end-to-end on Pt02: 2105 / 2113 EpitopePredictions annotated across 489 unique source sequences in ~3s, no segfault; 666 / 666 pytest pass with pepsickle on by default. * Pepsickle: prefix annotation fields with predictor name Three columns previously named ``c_term_cleavage_prob``, ``max_internal_cut_prob``, ``processing_score`` are renamed to ``pepsickle_c_term_cleavage_prob``, ``pepsickle_max_internal_cut_prob``, ``pepsickle_processing_score``. Report column headers (HTML / PDF / ASCII / JSON) get the matching ``Pepsickle`` prefix. A reader looking at a CSV column needed to know — out of band — that those values came from pepsickle; the names didn't say so. With the prefix the predictor is unambiguous in data, log lines, and report headers, and a future per-position cleavage predictor (NetChop, PAProC, …) can land alongside without name collision. No behavior change beyond the rename; all 666 tests pass with pepsickle on by default. * Fail fast when no --output-* flag is set (LENS / pVACseq path too) ``check_args`` already validated that at least one primary output path was set, but it ran only on the VCF/BAM pipeline branch — the external-input branch (--input-lens / --input-pvacseq) skipped it. Result: a user could run a long LENS import to completion and end up with nothing on disk, the only signal a quiet ``wrote=['(none)']`` log line late in the run. Hoist the check above the input-dispatch branch so it fires for both paths, before any input loading or prediction runs. Replace the ad-hoc string-concatenated error with a table-driven message that lists each --output-* flag and its purpose, sourced from a single ``_PRIMARY_OUTPUT_FLAGS`` tuple so the help text and the validation list can't drift apart. * Pepsickle: defer per-peptide arithmetic to mhctools (closes #267) vaxrank's ``_per_position_processing`` and ``_composite_processing_score`` were reimplementing ``ProcessingPredictor.c_term_prob`` / ``max_internal_prob`` and ``score_cterm_anti_max_internal``, which already live in mhctools and are byte-identical to the hand-rolled versions. Drop the local helpers and call the mhctools ones directly. Net result: ~30 lines of arithmetic gone from vaxrank, single source of truth in mhctools, and a future scoring fix or length-edge-case tweak in mhctools flows here automatically. The unit tests for the two helpers go away with them — that arithmetic is mhctools' responsibility to test, not vaxrank's. One new test pins our local out-of-range guard around the slice helpers (peptide span past source → ``(None, None)`` so the caller drops the row). Subprocess isolation for the macOS libomp clash stays here for now — ``mhctools.Pepsickle`` doesn't yet offer it built-in. Tracked in openvax/mhctools#200; once that ships, the subprocess wrapper here collapses to ``Pepsickle(..., isolate_subprocess=True)``. Imports of ``mhctools.processing_predictor`` at module level pull in only the wrapper module — torch stays unloaded in the parent, so subprocess isolation still works as designed. * Reject pipeline-only reports on --input-lens / --input-pvacseq path Passing ``--output-ascii-report`` (or ``--html`` / ``--pdf``) with ``--input-lens`` ran to completion and silently wrote nothing — the template-report block in ``main()`` was gated on ``source == 'pipeline'`` and short-circuited on the external path. The deeper problem is that ``TemplateDataCreator`` walks ``mutant_protein_fragment.predicted_effect()``, which calls ``variant.effect_on_transcript`` and expects pyensembl ``Transcript`` objects; LENS / pVACseq carry transcript IDs as strings only. Variant-counting metadata (somatic / coding / RNA-supported) is also only produced from VCF + BAM. Making the external path produce template reports is its own structural project (filed as #268) — for this PR, just reject the combination in ``check_args`` with an explicit message that points the user at the outputs that ARE reachable from external input (``--output-csv`` / ``--output-xlsx-report`` / ``--output-neoepitope-report`` / ``--output-json-file``). Pinned by ``test_lens_cli_errors_when_paired_with_pipeline_only_report``. * Reference vaxrank#268 from check_args docstring * Unify vaccine output flags: --vaccine-type single-valued + --vaccine-output The old design used a multi-valued ``--vaccine-type peptide mrna`` plus separate ``--output-peptide`` / ``--output-mrna`` flags whose names looked like data-format outputs (peer to ``--output-csv``) but actually drove vaccine-construct *design*. That mismatch made the flags easy to miss — passing ``--input-lens FILE`` alone left ``vaccine_type=['peptide']`` active but no peptide writer fired, producing a quiet ``wrote=['(none)']`` log line and zero files on disk. CLI rework — one mode per run: - ``--vaccine-type {peptide,mrna}`` — single-valued, default ``peptide`` - ``--vaccine-output PATH`` — destination, interpreted by the active type (peptide → FASTA file, mrna → directory) - ``--vaccine-manifest`` — JSON manifest, shared schema across types - ``--vaccine-order-form`` — peptide-only; ``check_args`` errors if --vaccine-type=mrna - ``--vaccine-csv`` / ``--vaccine-csv-no-full-rows`` — mrna-only; ``check_args`` errors if --vaccine-type=peptide Removed: ``--output-peptide``, ``--output-peptide-manifest``, ``--output-peptide-order-form``, ``--output-mrna``, ``--output-mrna-manifest``, ``--output-mrna-csv``, ``--output-mrna-csv-no-full-rows``. Internal cleanup: ``_resolve_vaccine_types`` (returning a set) → ``_resolve_vaccine_type`` (returning a string); the dispatch-table loop in ``_emit_outputs`` collapses to a single lookup; the mismatched-flag warning machinery goes away because the new companion flags are statically tied to one type. Verified end-to-end on Pt02 LENS: - ``--vaccine-type peptide --vaccine-output peptides.fasta`` → 20 SLPs written - ``--vaccine-type mrna --vaccine-output mrna_dir/`` → cds / no_polyA / full FASTAs written Test churn: ``test_resolve_vaccine_types_*`` collapses to one single-value test; ``test_emit_outputs_warns_when_output_path_lacks_matching_type`` and ``test_output_path_without_matching_type_does_not_write`` are no longer applicable (companion-flag mismatches are now hard errors); ``_mrna_args`` test helper centralizes the mRNA-design defaults so individual tests stay focused. 661 tests pass. * Make ASCII / HTML / PDF reports work on --input-lens / --input-pvacseq (closes #268) LENS and pVACseq aggregates carry every field the template reports need — ``transcript_id`` (per-row Ensembl ID), ``variant_coords`` + ref/alt allele columns, ``rna_reads_covering_genomic_origin``, ``gene_name``. Earlier the external-input path stored those as strings on ``MutantProteinFragment.supporting_reference_transcripts`` and the template renderer crashed deep inside ``variant.effect_on_transcript`` (TypeError: expected Transcript, got str), so the reports were short-circuited with ``source != 'pipeline'`` and ``check_args`` rejected the flags. This change resolves the data so reports render end-to-end: - ``_resolve_transcripts`` (vaxrank/external_input.py) — turns Ensembl-ID strings into pyensembl ``Transcript`` objects via the ``--ensembl-release``-configured genome. Strips version suffixes (``ENST00000312960.4`` → ``ENST00000312960``) since pyensembl 2.x doesn't auto-strip. Unresolvable IDs are dropped silently (DEBUG-logged) so a release mismatch downgrades rather than crashes. - ``_parse_variant_coords`` strips the ``chr`` prefix LENS emits. Without this the variant short_description rendered as ``chrchr3 …`` and ``effect_on_transcript`` found no transcripts (pyensembl uses bare contigs). - ``_patient_info_from_external`` synthesizes a ``PatientInfo`` from the loaded ranked output: ``num_somatic_variants`` = unique-variants-with-antigens, ``num_coding_effect_variants`` = unique-variants-with-resolved-Transcript, ``num_variants_with_rna_support`` = unique-variants-with-non-zero reads. ``load_external_ranked`` now returns a 4-tuple including this. - ``MutantProteinFragment.predicted_effect()`` returns ``None`` when no transcripts resolve OR when varcode rejects every transcript (e.g. ``ReferenceMismatchError`` when LENS / pVACseq was called against a different reference than the configured pyensembl release). Empty list → None and a list of failures → None — the template renderer tolerates both. - ``TemplateDataCreator._effect_data`` / ``_query_cancer_hotspots`` / ``compute_template_data`` render ``"—"`` placeholders when the effect is None. - ``external_input_arg_parser`` now exposes ``--ensembl-release`` and pulls in ``add_optional_output_args`` / ``add_supplemental_report_args`` so the namespace shape matches the pipeline parser. - The ``check_args`` rejection for "pipeline-only reports on LENS" is gone — those reports work now. - ``main()`` drops the ``source != 'pipeline'`` short-circuit; template reports run on both paths through one code path. - pVACseq's "Best Transcript" column is now consumed (was hardcoded to []). Verified end-to-end on Pt02 LENS (HugoLo IPRES 2016, 2153 predictions / 489 source proteins) with ``--ensembl-release 75``: ASCII report writes, 384 / 406 variants render with full effect annotation (Effect type / Transcript name / ID / description); the remaining 22 are reference-allele mismatches between LENS's upstream caller and pyensembl 75's reference. Test churn: ``test_parse_variant_coords_extracts_chr_pos`` updated to assert the bare-contig form; ``_variant_from_lens_row`` SNV / INDEL tests likewise; new ``test_lens_cli_writes_ascii_report`` pins the end-to-end behavior. * Review fixes: stale docstring, specific exceptions, dedup, test gaps Address every blocking finding from the /review on PR #262: 1. Stale docstring in ``_resolve_transcripts`` claimed pyensembl strips version suffixes internally; that's exactly what we discovered it *doesn't* do. Corrected to match the actual strip-then-lookup behavior. 2. Replace bare ``except Exception`` with specific exception types: - ``_resolve_transcripts``: ``(ValueError, KeyError)`` — pyensembl's not-found shape. - ``MutantProteinFragment.predicted_effect``: lazy-import ``varcode.errors.ReferenceMismatchError`` and catch ``(ReferenceMismatchError, ValueError, KeyError)``. Other exceptions propagate so genuine bugs aren't swallowed. 3. Aggregate INFO/WARN summary for transcript-resolution failures in ``ranked_from_lens_predictions`` and ``ranked_from_pvacseq_predictions``: a single end-of-load WARN covers (a) "had IDs but no --ensembl-release" and (b) "had IDs but release mismatch — N/M didn't resolve." The per-row debug logs stay so the individual IDs are still inspectable. End-to-end on Pt02 LENS without ``--ensembl-release`` now emits one clear WARN pointing the user at the fix instead of silently rendering "—" everywhere. 4. ``_KNOWN_VACCINE_TYPES`` set deleted; ``_resolve_vaccine_type`` now derives known types from ``_VACCINE_TYPE_DISPATCH`` directly. Adding a new vaccine writer is one registration, not two. 5. Dead intermediate (``variants = [v for v, _ in ranked]; n_total = len(variants)``) replaced with ``len(ranked)``. ``num_variants_with_vaccine_peptides`` rolled into the same loop instead of a duplicate ``sum(...)``. Comment pinned that the "first VP carries the transcript" assumption holds because LENS / pVACseq loaders emit one VP per variant — flag if that ever changes. 6. ``_cleavage_probs_via_subprocess`` docstring now notes the ~1–2s per-call torch-import cost (one launch per ``annotate_processing``, batched across all sources) and points at openvax/mhctools#200 for the upstream fix. 7. New test coverage: - ``_resolve_transcripts``: version-suffix stripping, drop-on- unresolvable, no-genome short-circuit (3 tests, ``_StubGenome`` fixture). - ``_patient_info_from_external``: proxy-count semantics + empty- ranked path. - ``MutantProteinFragment.predicted_effect``: returns None on empty transcripts AND when varcode raises ``ReferenceMismatchError`` for every transcript. - ``--ensembl-release`` reachable through the external-input parser (regression guard). 669 tests pass (was 661; +8 from this commit). * pVACseq: strip chr prefix in _parse_pvacseq_id (matches LENS fix) The earlier ``chr``-prefix fix on the LENS path was missing its symmetric pVACseq counterpart. ``_parse_pvacseq_id`` builds Variants with ``normalize_contig_names=False``, so a pVACseq ID like ``chr1-100000-100001-A-T`` would land on the Variant as ``contig='chr1'`` and silently break ``effect_on_transcript`` against pyensembl (which uses bare contigs). Same one-liner strip as ``_parse_variant_coords``; tests updated to assert the bare-contig invariant on both parsers. Also fix the ``check_args`` error wording — it claimed the listed flags were "--output-* flags," but the list now includes ``--vaccine-output`` which doesn't carry the prefix. 669 / 669 pass. * Pre-flight ensembl-release hint, subprocess module, README rewrite, 2.17.0 Round-tripping the /review output: 1. Subprocess body extracted from inline ``_SUBPROCESS_SCRIPT`` into importable ``vaxrank/_pepsickle_subprocess.py`` with a ``main()`` entry. The body is unchanged (still pure ``mhctools.Pepsickle`` I/O), but it's now testable without spawning a subprocess — ``test_pepsickle_subprocess_main_io_contract`` exercises the JSON in/out contract via patched mhctools, and ``test_pepsickle_subprocess_main_returns_2_when_mhctools_missing`` pins the import-error exit-code path. Wrapper exists only because openvax/mhctools#200 hasn't shipped subprocess isolation yet. 2. Ensembl-release inference + pre-flight hint. New ``infer_genome_build_from_lens`` reads the LENS file's ``origin_descriptor`` column and returns ``GRCh37`` / ``GRCh38`` / None based on Hsap37 / Hsap38 markers (LENS ERV-row format). ``check_args`` calls it at startup when external input is paired with template-report flags but no ``--ensembl-release`` is set, and emits a single WARN that names the inferred build and a plausible release. Verified live on Pt02: WARN now fires before any LENS load, naming GRCh38 + release 102 specifically. 3. Doc + version updates: - README.md: every ``--output-peptide`` / ``--output-mrna`` example replaced with the new ``--vaccine-output`` form; ``--vaccine-type`` documented as single-valued (was multi-valued); LENS-driven full-report example added with ``--ensembl-release``. - vaxrank/version.py: 2.16.0 → 2.17.0 (additive features + breaking CLI: removed --output-peptide / --output-mrna / --output-peptide-manifest / --output-peptide-order-form / --output-mrna-manifest / --output-mrna-csv / --output-mrna-csv-no-full-rows). 4. New tests: - subprocess I/O contract (2 tests, no pepsickle dependency). - ``infer_genome_build_from_lens``: GRCh38 / GRCh37 / unknown (3 tests). - ``args.genome`` plumbing on the external path: ``parse + _resolve_ensembl_release → args.genome is EnsemblRelease(75)``. 675 / 675 tests pass. * Pepsickle: drop in-tree subprocess wrapper; use mhctools 3.13.3 isolation mhctools 3.13.3 (openvax/mhctools#201) ships built-in ``Pepsickle(isolate_subprocess=True)`` so the libomp-clash workaround collapses from a vaxrank-side launcher + bespoke subprocess module to a single kwarg. Removed: - ``vaxrank/_pepsickle_subprocess.py`` (the pure-mhctools subprocess body lifted to ``openvax/mhctools/_pepsickle_subprocess.py``). - ``processing._cleavage_probs_via_subprocess`` and ``_SUBPROCESS_MODULE`` (was launching the deleted module). - The two unit tests that exercised the local module's I/O contract. Added: - ``processing._load_default_predictor(human_only, threshold)`` — thin builder that constructs ``Pepsickle(isolate_subprocess=True)`` and degrades to None with a logged warning if mhctools isn't importable. ``annotate_processing`` now resolves a default predictor through this helper and calls ``cleavage_probs`` directly on it (same loop body that the test-seam ``predictor=`` arg has always used). - ``test_load_default_predictor_returns_none_when_mhctools_missing`` pins the graceful-degrade path. Bumped: - ``requirements.txt``: mhctools >=3.13.3 (was >=3.13.1). End-to-end on Pt02 LENS: 2105/2113 EpitopePredictions annotated across 489 unique sources — same numbers as before the refactor, confirming the subprocess invocation moved upstream without changing results. Test count: 674 (was 675; net −1 from removing the two local-subprocess tests + adding one default-predictor smoke). * Fix pepsickle perf cliff; promote ensembl-release WARN to pre-flight error; verify antigen kinds in skip logs Three independent improvements that all came out of running PR #262 on real Pt02 data with auto-mode logging. 1. Perf cliff: ``annotate_processing`` was calling ``predictor.cleavage_probs(source)`` per unique source. Under ``Pepsickle(isolate_subprocess=True)`` that's a fresh subprocess PER call (mhctools 3.13.3's ``cleavage_probs(s)`` delegates to ``cleavage_probs_many([s])`` internally) — ~1-2s startup × 489 sources turned a 30s run into ~15 minutes "stuck." Switch to ``cleavage_probs_many`` for one batched subprocess invocation. End-to-end on Pt02: 3.2s total (was effectively hung). Stub predictors that only implement ``cleavage_probs`` still work via the fallback per-source loop. 2. ``--ensembl-release`` requirement promoted from late warning to early error. External input + template-report flag (ASCII / HTML / PDF) without ``--ensembl-release`` would silently render a degraded report with empty effect annotations everywhere. ``check_args`` now raises ValueError pre-flight with the build-inferred release suggestion (``--ensembl-release 102`` when ``origin_descriptor`` says GRCh38, etc.). The redundant loader-level WARN is gone — the release-mismatch case is still logged from the loader since pre-flight can't predict resolution success. 3. Empty-coords / no-gene / no-transcript skip logs now show the actual ``antigen_source`` breakdown instead of asserting "typical for X". Pt02 reveals the empty-coords skip is genuinely composed of CTA/SELF=276, ERV=209, SPLICE=126, FUSION=2 — every kind that's expected to lack genome coords. When SNV / INDEL rows lack these fields, a separate WARN surfaces it as a likely upstream LENS bug rather than burying it in the breakdown. 4. ``annotate_processing`` now emits a single aggregate WARN summarizing peptides skipped because they aren't substrings of their pep_context source (real LENS-data anomaly: 8/2113 on Pt02 — peptide and pep_context came from different isoform/annotation snapshots; see prior analysis). The WARN includes a representative (peptide, pep_context) pair the user can paste into an upstream bug report. Test count: 676 (was 675; +2 new tests for the early-error and peptide-not-in-context paths, −1 fixed test that pinned the now- deleted late warning). * Pre-flight ensembl-release suggestion: list installed releases instead of hardcoding 102 Previously the pre-flight error message hardcoded a suggestion of ``--ensembl-release 102`` for GRCh38 inputs. That number was arbitrary — picked once during development as "stable mid-2020 default" — and ``origin_descriptor`` only tells us the *build* (GRCh37 vs GRCh38), not the release. GRCh38 spans Ensembl 76 through ~114+. Replace the hardcoded mapping with ``installed_ensembl_releases_for_build``, which walks the pyensembl cache (``<platformdirs>/pyensembl/<build>/ensembl<N>``) to enumerate releases the user actually has on disk. The pre-flight error now lists those releases and suggests the latest, with an explicit "origin_descriptor doesn't pin a specific release" caveat so the user knows the suggestion is best-effort, not derived from the file. Three branches: - Local cache has releases for this build → list them, suggest the latest as a quick-start. - Cache empty AND build is GRCh37 → 75 is canonical (final mainline release); suggest installing it. - Cache empty AND build is GRCh38 → no canonical release; acknowledge that and ask the user to pick one matching the LENS file's source build, with an install command. Live verification on Pt02 (29 locally-installed GRCh38 releases): the message now says "you have these GRCh38 releases installed: [77, 78, ..., 113, 114]; try --ensembl-release 114" — concrete, correct, and points at something the user can actually run without first downloading data. 677 / 677 tests pass; +1 test pinning the cache-walk helper. * LENS report quality: inputs section, DNA VAF, dedup epitopes, SLP-windowed fragment Four fixes from running the LENS-driven ASCII report on real Pt02: 1. **Inputs section**: ``PatientInfo`` gains an ``inputs: list[(label, path)]`` field rendered verbatim by the report builder. The external (LENS / pVACseq) path now labels its file "LENS report: …" or "pVACseq report: …" instead of misleadingly stuffing it into ``vcf_paths`` and rendering as "VCF (somatic variants) path(s)". The legacy ``vcf_paths``/``bam_path`` fields stay for backward compat but are no longer overloaded by external paths. 2. **DNA VAF**: read LENS's ``vaf`` column (sits between ``mhcflurry_agretopicity`` and the DNA-clonal columns ``totcopynum`` / ``multiplicity`` / ``ccf``; tracks DNA VAF on real Pt02 data — 0.141 vs computed RNA VAF 0.138). Plumb through ``dna_vaf_by_variant`` so the report's "DNA VAF" line populates instead of "n/a". pVACseq path likewise reads its explicit ``DNA VAF`` column (separate from ``RNA VAF``). ``ranked_from_*_predictions`` now returns ``(ranked, dna_vaf_by_variant)`` so the dispatcher can thread it through. 3. **Dedup epitope rows**: real LENS files have multiple rows per ``(peptide, allele)`` when the same peptide is encoded by several transcripts (~5% of Pt02 rows). The loader was emitting one ``EpitopePrediction`` per row and ``ranked_from_lens_predictions`` ``extend``ed them all into the VP, producing duplicate rows in the per-VP epitope table (one annotated by pepsickle, one not, because annotation iterates by id()). Dedup in the loader by a content key (peptide, allele, ic50, percentile_rank, prediction_method_name) before assembling the VP. 4. **Vaccine peptide too long → SLP window**: the LENS path was using ``pep_context`` directly as ``MutantProteinFragment.amino_acids``, but LENS sometimes emits a 100+ aa protein-prefix as pep_context. Vaccine peptides rendered as 100+ aa instead of the canonical 25mer. Add ``MutantProteinFragment.slp_window_around_mutation(...)`` — centers a target-length window on the mutation span — and use it from the LENS loader. The pipeline path gets correctly-sized fragments by construction (Isovar emits exactly the requested width); the LENS path now lands on the same shape via this shared helper instead of inheriting LENS's variable pep_context length. ``--vaccine-peptide-length`` (default 25) threads through ``load_external_ranked``. End-to-end on Pt02 (``--ensembl-release 113``): - Patient header: "LENS report: …" - DNA VAF: 0.141, RNA VAF: 0.138 (distinct) - Vaccine peptide: 25aa (was 103aa) - One row per (peptide, allele) in the epitope table 677 / 677 tests pass. * PatientInfo unification: pipeline path also uses inputs; LENS infers MHC alleles Two improvements that share more code between VCF/BAM and external (LENS / pVACseq) paths: 1. Pipeline path now populates ``PatientInfo.inputs`` with ``[("VCF (somatic variants)", path), ("BAM (RNAseq reads)", bam)]`` so both code paths feed the same renderer with the same shape. The legacy ``vcf_paths`` / ``bam_path`` fields stay populated for backward-compat with previously-saved JSON; new code reads ``inputs`` exclusively. 2. ``_patient_info_from_external`` now takes ``predictions`` and infers the unique MHC alleles from them — LENS / pVACseq don't carry an explicit alleles header, but the alleles are implicit in the per-row predictions. The patient-info block now shows them instead of leaving "MHC alleles:" blank, with an explicit "(inferred from report)" suffix so the reader knows the source. 3. Dropped the "— origin_descriptor doesn't pin a specific release" qualifier from the pre-flight error per user feedback; the "(or whichever matches the build LENS used)" is sufficient. 677 / 677 tests pass. * Report quality: name predictor, processing-agnostic columns, geometric-mean composite, modality-aware manufacturability Five report-quality changes from real-Pt02 review: 1. **MHC predictor visible per row**: new ``Predictor`` column in the per-VP epitope table (mhcflurry / netmhcpan / etc.). LENS files can be multi-predictor; pipeline runs are usually single-predictor but consistent rendering doesn't hurt. 2. **Score column header named for clarity**: "Score" → "Score (affinity, logistic IC50)". The score is a logistic transform of IC50, not mhcflurry's presentation_score / EL — vaxrank computes it locally so it's comparable across predictors. The longer header eliminates the ambiguity. 3. **Predictor-agnostic processing columns**: column headers "Pepsickle C-term cut" / "Pepsickle max internal cut" / "Pepsickle processing score" → "Processing: C-term" / "Processing: max internal" / "Processing: combined". The underlying field names stay ``pepsickle_*`` so a future per- position predictor (NetChop, PAProC) can land alongside without collision; the active predictor is named once in the patient-info header as "Processing predictor: pepsickle". 4. **Composite score → geometric mean**: ``pepsickle_processing_score`` was ``c_term * (1 - max_internal)`` (mhctools' canonical ``score_cterm_anti_max_internal``). Switch to ``sqrt(c_term * (1 - max_internal))`` — the geometric mean of the two factors. Less-aggressive penalty when both factors are mid-range; a row with c_term=0.6 and (1 - max_internal)=0.6 now scores ~0.6 instead of 0.36, which matches how downstream readers interpret the combined as a "credibility tag" rather than a strict joint probability. Pt02 example: c_term=0.84, max_internal=0.38 → was 0.52, now 0.72. 5. **Modality-aware manufacturability default**: the GRAVY / Cys- content / N-terminal-Q manufacturability section in template reports defaults on for ``--vaccine-type=peptide`` (relevant to peptide synthesis) and off for ``--vaccine-type=mrna`` (those features don't apply to mRNA constructs). Explicit ``--include-manufacturability-in-report`` / ``--no-manufacturability-in-report`` still override. 677 / 677 tests pass. * Generic, modality-aware reports: drop peptide-coded language; surface vaccine type Cleanup of the report-generation surface so the same flags (``--output-ascii-report`` / ``--output-html-report`` / ``--output-pdf-report`` / ``--output-csv`` / ``--output-xlsx-report`` / ``--output-json-file`` / ``--output-neoepitope-report``) work across every ``--vaccine-type`` and adapt their content to the active modality, rather than always rendering peptide-specific material. Visible changes on Pt02 LENS: - Patient header now opens with "Vaccine type: peptide" / "Vaccine type: mrna" so the rendered modality is explicit. - Section heading "Vaccine Peptides:" → "Vaccine antigens:" — same content, modality-neutral name. - Manufacturability section (GRAVY, Cys content, N-terminal Q/E/C, Asp-Pro bonds) renders by default for ``--vaccine-type=peptide`` and is suppressed for ``--vaccine-type=mrna`` (those metrics are about peptide synthesis; mRNA constructs translate in vivo). Explicit ``--include-manufacturability-in-report`` / ``--no-manufacturability-in-report`` still override. Help-text rewrites: every "vaccine peptide report" reference in ``arg_parser.py`` is now "summary report" (or similar) with an explicit "antigen-centric; same flag for all --vaccine-type modes" note. Future modalities (DNA, etc.) plug in by extending ``--vaccine-type`` choices and registering their writer in ``_VACCINE_TYPE_DISPATCH``; no rename of any output flag is needed. Filed openvax/vaxrank#269 for the related but separate work of HLA-balanced antigen packing across multi-construct mRNA vaccines — not included here because it changes assembly semantics and needs its own evaluation against the existing greedy strategy. 677 / 677 tests pass. * Multi-mode --vaccine-type + --output-dir; canonical filenames per modality Reworked the vaccine-type / output API to support designing multiple modalities in one run while keeping the single-mode case flat: CLI changes (breaking): - ``--vaccine-type`` is multi-valued again (``nargs='+'``, default ``['peptide']``). Previous single-mode runs still work (one value); multi-mode runs accept e.g. ``--vaccine-type peptide mrna``. - Removed ``--vaccine-output``, ``--vaccine-manifest``, ``--vaccine-order-form``, ``--vaccine-csv``, ``--vaccine-csv-no-full-rows``. The path-companion flags went away because they couldn't disambiguate in multi-mode (which manifest? peptide's or mrna's?). - Added ``--output-dir DIR``: always a directory, never a file path, no extension-based mode switching. Vaxrank picks canonical filenames inside. - Added ``--mrna-csv-no-full-rows`` (the boolean knob from the removed ``--vaccine-csv-no-full-rows``; now lives in the mrna knob group). Output layout: - Single-mode (one --vaccine-type): canonical files land directly in ``--output-dir``. peptide → vaccine.fasta + manifest.json + order_form.csv mrna → cds.fasta + no_polyA.fasta + full.fasta + manifest.json + layers.csv - Multi-mode (≥2 types): per-modality subdirs so canonical filenames don't collide on ``manifest.json``. DIR/peptide/... DIR/mrna/... Internals: - ``_resolve_vaccine_type`` → ``_resolve_vaccine_types`` (returns ordered, deduplicated list). Bare-string callers still tolerated. - New ``_vaccine_target_dir(output_dir, vtype, all_active)`` — shared "where do my files go" helper. Single-mode flat, multi-mode subdir. - ``_emit_outputs`` iterates active types; each writer takes ``(args, ranked, target_dir)`` and uses canonical filenames inside. - Manufacturability default tightened: was "on for vaccine_type=peptide", now "on iff 'peptide' in active types". Mixed peptide+mrna runs include manufacturability (peptide is active); mrna-only runs omit it. Verified end-to-end on Pt02 LENS: - ``--vaccine-type peptide --output-dir out/`` → flat layout with vaccine.fasta + manifest.json + order_form.csv - ``--vaccine-type peptide mrna --output-dir out/`` → subdir layout: out/peptide/{...} + out/mrna/{...}, both written successfully Test churn: ``_mrna_args`` rebuilt around the new fields; ``_resolve_vaccine_types`` test pins multi-mode + dedup + first-occurrence order; new ``_vaccine_target_dir`` test pins the single-vs-multi branching; new ``test_lens_with_multi_vaccine_type_uses_subdirs`` end-to-end test; old ``test_emit_outputs_skips_writer_when_no_vaccine_output`` renamed to ``..._no_output_dir``. README + arg_parser docstrings updated. 679 / 679 tests pass. * Default --vaccine-type to 'peptide mrna'; address /review findings Default change: ``--vaccine-type`` now defaults to ``['peptide', 'mrna']`` (was ``['peptide']``). Both modalities share the same ranking, pepsickle annotation, and load work; only the construct-assembly step differs. Designing both by default matches what users typically want and leaves single-modality runs to an explicit ``--vaccine-type peptide`` (or mrna). /review findings addressed: #2 Reject existing-file paths and FASTA-like suffixes in ``--output-dir`` early via ``check_args`` → new ``_reject_output_dir_misuse`` helper. Catches ``--output-dir out.fasta`` / ``--output-dir existing_file.txt`` before any ranking work happens, with an actionable error message. #3 Multi-mode partial failures: ``_emit_outputs`` now wraps each writer in try/except and logs ERROR + continues, so an mrna assembly bug doesn't silently abort a peptide run (or vice versa). Other modalities still get attempted; final dispatch line records what fired. #4 Cleaner dispatch log: ``wrote=[]`` instead of ``wrote=['(none)']``. Anyone parsing the line gets a real Python list instead of a list-with-sentinel. #5 Manufacturability resolution dedups the shape-tolerance code: ``args.manufacturability = 'peptide' in _resolve_vaccine_types(args)``. One source of truth for the type-list parsing. #9 ``--output-csv`` help text now spells out the format duality: pipeline path emits per-variant rows; LENS / pVACseq emit per-(peptide, allele) rows. Antigen-centric in both cases; doesn't change with --vaccine-type. #10 New ``test_manufacturability_default_on_for_peptide_only`` pins the modality-aware default for each combination of --vaccine-type, plus ``test_manufacturability_explicit_flag_overrides_default`` confirms ``--include-manufacturability-in-report`` / ``--no-manufacturability-in-report`` still win. #7 Test-fixture refactor: ``_mrna_args`` → ``_vaccine_args`` covering both peptide and mrna knobs (alias for backward-compat in tests that already say ``_mrna_args``). The multi-mode test collapses from 18 lines of manual attribute injection to one call. End-to-end on Pt02: default ``vaccine_type=['peptide', 'mrna']`` with ``--output-dir vaccines/`` writes vaccines/peptide/{...} and vaccines/mrna/{...}; without ``--output-dir`` the dispatch line shows ``wrote=[]`` and only requested analysis reports get written. 681 / 681 tests pass. * YAML config: vaccine_constructs.{shared,peptide,mrna} with CLI override New top-level YAML section ``vaccine_constructs`` that drives construct-assembly knobs. Layout: vaccine_constructs: shared: # cross-modality defaults (length bounds, antigen_content, …) peptide: # peptide-specific (mode, linker, n-term acetyl, …) mrna: # mRNA-specific (signal_peptide, UTRs, polyA, junctions, …) Resolution order (highest precedence first): explicit CLI flag > per-modality config > shared config > built-in default Implementation: - ``vaxrank/config/schema.py``: new ``VaccineConstructsConfigSchema`` with ``_SharedConstructConfigSchema``, ``_PeptideConstructConfigSchema``, ``_MrnaConstructConfigSchema`` subschemas. msgspec validates at YAML-load time so unknown keys (typos) error fast. Knobs whose defaults differ between modalities (linker, antigens_per_construct, max_constructs) live in the per-modality subsections; only knobs whose defaults are the same go in shared. - ``vaxrank/config/loader.py``: new ``extract_construct_kwargs(config, modality)`` merges ``vaccine_constructs.shared`` with ``vaccine_constructs.<modality>``; modality-specific values override shared values. None entries are dropped so callers can ``dict.get(key, fallback)`` cleanly. - ``parse_vaxrank_args`` now snapshots parser defaults into ``parsed._parser_defaults`` so downstream code can detect "user explicitly passed this flag" by comparing ``args.X`` to the snapshot. - ``_emit_peptide_constructs`` / ``_emit_mrna_constructs`` consume YAML config via a small ``cfg(cli_attr, yaml_key)`` closure that resolves through the precedence chain. Every CLI flag they take is now config-overridable; CLI still wins when explicit. - ``vaxrank/config/default.yaml``: new section with a complete commented-out template showing every knob and the layered ``shared`` / ``peptide`` / ``mrna`` structure. Default config is still a no-op — uncomment knobs to change behavior. Tests: - Four new ``test_extract_construct_kwargs_*`` tests pin the merge semantics (shared-only, modality-override, None-dropping, absent-section). - ``test_construct_config_overrides_cli_default`` end-to-end: ``vaccine_constructs.peptide.linker: AAY`` drives the writer when the user runs without ``--peptide-linker``, and CLI ``--peptide-linker EAAAK`` overrides the YAML. Live verification on lens_example.tsv: a config setting ``vaccine_constructs.peptide.linker: AAY`` plus ``antigens_per_construct: 3`` packs TP53 + BRAF + KRAS antigens into one construct with AAY linkers visible in the FASTA output — neither value passed on the CLI. 686 / 686 tests pass. * Multi-mode dispatch: any writer failure aborts the whole run Reverted the try/except + continue from the prior /review fix. Partial vaccine output is worse than no vaccine output: ending up with a peptide FASTA on disk but no mRNA directory (or vice versa) is the kind of half-state that quietly ships to a downstream reviewer. ``_emit_outputs`` now lets writer exceptions propagate so the run fails loud. The pre-raise log line names which modality crashed and which already wrote — enough operator context to decide whether to clean up partial files before retrying. Single-modality runs are unaffected (one writer; failure already bubbled). 686 / 686 tests pass. * Rename vaccine_constructs.shared → defaults ``shared`` reads weirdly in single-modality runs ("shared with what?"). ``defaults`` reads cleanly at any --vaccine-type count: "default values; per-modality blocks override them." Same merge semantics as before: vaccine_constructs: defaults: # values applied to every modality not overriding them peptide: # peptide-specific overrides mrna: # mRNA-specific overrides Resolution order unchanged: explicit CLI flag > per-modality config > defaults config > built-in default Schema, loader, default.yaml, tests all renamed in lockstep. No behavior change. 686 / 686 tests pass. * Flatten vaccine_constructs: cross-modality knobs at section top level User's point: ``defaults:`` was a misnomer. Those values aren't "defaults" the user is overriding — they're the actual values for the run. Drop the subsection wrapper; cross-modality knobs go directly under ``vaccine_constructs:``. Modality subsections (``peptide:`` / ``mrna:``) are only for modality-specific overrides or modality-only knobs. Before: vaccine_constructs: defaults: max_antigen_length_aa…
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds opt-in subprocess isolation for
mhctools.PepsickleviaPepsickle(isolate_subprocess=True). The isolated path sends unique input sequences plus the existinghuman_onlyandthresholdoptions to a short-lived Python subprocess over JSON stdin/stdout, so pepsickle can load torch without sharing the parent process already-loaded OpenMP runtimes.The processing predictor base now has a
cleavage_probs_manyhook, used bypredict,predict_proteins, andpredict_cleavage_sites, so isolated pepsickle calls batch unique sequences into one subprocess invocation while preserving the existingcleavage_probsAPI.Also bumps
mhctools.__version__from3.13.2to3.13.3.Fixes #200.
Validation
./lint.sh./test.sh(397 passed, 9 skipped, 2 xfailed)