Skip to content

Improve LENS/pVACseq-import UX, logging, and vaccine naming#302

Merged
iskandr merged 15 commits into
mainfrom
fix-output-dir-creation
May 22, 2026
Merged

Improve LENS/pVACseq-import UX, logging, and vaccine naming#302
iskandr merged 15 commits into
mainfrom
fix-output-dir-creation

Conversation

@iskandr

@iskandr iskandr commented May 20, 2026

Copy link
Copy Markdown
Contributor

Started as the parent-dir fix and broadened (per request) into a pass over the LENS/pVACseq-import experience.

Changes

Output / reports

  • Create the parent directory for neoepitope report outputs (the original crash: Cannot save file into a non-existent directory).
  • --output-dir now auto-populates the ASCII text + PDF reports too, not just the CSV.
  • Write a top-level run_summary.txt (inputs, MHC alleles in play, antigen counts, output map) so multi-type runs have shared run context at the top level alongside the neoepitope table and reports; per-modality constructs stay in subdirs.
  • Prompt before overwriting an existing non-empty --output-dir (interactive only; batch runs warn and proceed). New --force-overwrite.

Logging

  • Replace the scattered LENS-load count lines (which double-counted the same non-coord rows across two passes) with a single load funnel that makes the two row uses explicit: every row is scored as a candidate epitope for the neoepitope report, while only genomic-coord rows are placed on the genome for construct ranking.
  • Source-specific dispatch label: [LENS import] / [pVACseq import] / [full pipeline] instead of [external].
  • Rename "Pepsickle credibility tagging" -> "proteasomal-cleavage annotation".

Correctness

  • Fix inferred-MHC-allele aggregation: read alleles from CandidateEpitope.per_allele_scores (it has no .allele), so the per-junction linker optimizer + mhcflurry default actually engage on the external path. Surface the alleles once, up front.
  • Drop peptide-not-in-pep_context rows at load (excluded from report + constructs, warned once) instead of late inside pepsickle.

Naming / display

  • Redesign vaccine antigen naming: GENE (CATEGORY) (e.g. FYN (INDEL)), with the specific mutation appended only to disambiguate when a gene contributes 2+ variants (TP53 (SNV @ 1:100 A>T)). Empty indel alleles render as - rather than varcode's .. FASTA headers gain a trailing genes= field.
  • Neoepitope CSVs use blank uniformly for missing values (was a mix of "No prediction" strings and blanks).

Docs

  • README links pVACseq, BigMHC, topiary, and the input-format tools.

All 847 tests pass; tests updated for the intentional format changes.

iskandr added 8 commits May 20, 2026 13:49
write_neoepitope_report wrote the CSV/XLSX directly via pandas, which
refuses to write into a non-existent directory. The vaccine-construct
writers already os.makedirs their target dir, and arg_parser assumes the
writer does so for tabular outputs, but this path never did -- so
--output-csv foo/bar.csv crashed if foo/ didn't exist.

Add _ensure_parent_dir and call it before both writes.
- Fix inferred-MHC-allele aggregation: read alleles from CandidateEpitope
  per_allele_scores (it has no .allele), so the linker optimizer + mhcflurry
  default actually engage on the external path. Surface them once up front.
- Auto-populate ASCII + PDF reports under --output-dir (not just CSV).
- Drop peptide-not-in-pep_context rows at load instead of late in pepsickle;
  excluded from report + constructs, warned once at load.
- Source-specific dispatch label: [LENS import] / [pVACseq import] / [full pipeline].
- Rename 'Pepsickle credibility tagging' -> 'proteasomal-cleavage annotation'.

Tests updated accordingly.
Replace the scattered per-stage count lines (input breakdown, skipped-no-coords,
lack gene_name/transcript/reads) -- which double-counted the same non-coord rows
across two independent passes -- with a single funnel that makes the two row uses
explicit: every row is scored as a candidate epitope for the neoepitope report,
while only genomic-coord rows are placed on the genome for construct ranking.
Keep the SNV/INDEL anomaly WARNINGs. Tests repointed at the funnel line.
The LENS/pVACseq neoepitope CSVs mixed a 'No prediction' string in the affinity
columns with blank numeric cells. Use blank everywhere so each CSV is internally
consistent and loads cleanly into pandas/Excel; a blank prediction column reads
as 'not predicted'. The rich XLSX report keeps its own 'No prediction' prose.
Interactive runs get a y/N prompt (declining aborts); non-interactive runs warn
and proceed so batch pipelines don't hang. New --force-overwrite skips the
prompt. Added to both arg parsers via add_output_args.
Multi-type runs keep per-modality constructs in subdirs, but the shared run
context (inputs, MHC alleles in play, antigen counts, output map) now sits at
the top level alongside the neoepitope table and ASCII/PDF reports.
Antigen names now read 'GENE (CATEGORY)' (e.g. 'FYN (INDEL)') instead of
'gene_chr_pos_ref_alt'. The specific mutation is appended only to disambiguate
when a gene contributes 2+ variants to the construct set:
'TP53 (SNV @ 1:100 A>T)'. Empty indel alleles render as '-' rather than
varcode's normalized '.'. FASTA construct headers gain a trailing genes= field
listing the construct's distinct gene(s). Category derives from the varcode
variant so it's identical on the pipeline and LENS/pVACseq paths.

Minimal-epitope tags fold into the parens ('GENE (SNV, epitope)') so the gene
stays parseable. Tests updated to the new format.
@iskandr iskandr changed the title Create parent dir for neoepitope report outputs Improve LENS/pVACseq-import UX, logging, and vaccine naming May 20, 2026
iskandr added 4 commits May 20, 2026 21:03
Review follow-ups + shared-state unification:
- Reword the output-dir overwrite prompt (it overwrites same-named files, not a wipe).
- Consistent funnel antigen_source ordering via one shared sort key (SNV/INDEL first).
- ranked_from_{lens,pvacseq}_predictions early-returns now ([], {}) to match the
  2-tuple normal return so callers' unpacking can't crash on empty input.
- Verified the load-time mismatch drop has the SAME tolerance as the prior
  pepsickle path (_closest_occurrence is exact-match, not fuzzy) — no change.

Shared epitope state: the VCF pipeline, LENS, and pVACseq paths already converge
on CandidateEpitope (candidate_epitopes_from_rows), DSL scoring, and VaccinePeptide
assembly. The remaining divergence was the neoepitope report table — three
hand-rolled row builders (where the 'No prediction'/blank drift originated). Extract
neoepitope_core_row() and route all three (_build_lens_report_row,
_build_pvacseq_report_row, report.make_minimal_neoepitope_report) through it, so the
shared columns + missing-value convention are produced by one piece of code. LENS
CSV column order preserved; pipeline report's WT affinity now blank (consistent).
The report's parameter dump leaked the argparse 'version' SUPPRESS sentinel and
listed every unset (None/'') field (e.g. 'manufacturability: None'), which
misrepresents what actually ran. Drop output paths, internal config-plumbing
keys, the SUPPRESS sentinel, and unset values — leaving the effective parameters.
Multi-vaccine-type runs now write a modality-agnostic core report (antigen
ranking + coverage, no construction blocks or manufacturability) at the top
level, plus a per-modality report in each subdir (peptide/, mrna/) = core +
that modality's construction detail. Manufacturability follows
--manufacturability for the peptide report, is off for the core + mRNA reports.

Single-modality / no --output-dir runs keep the one combined report.

TemplateDataCreator gains an include_manufacturability override; the
antigen-ranking blurb points to the per-modality reports when the current
report has no construction sections.
Shared, modality-agnostic helpers in external_input (used by both the LENS and
pVACseq loaders, not duplicated per modality):

- check_varcode_annotation / AnnotationCheck: validate the provider's gene +
  effect class against varcode on the provider's own transcript (positions are
  isoform-dependent, so compare class not residue numbers). varcode is used only
  to validate/interpret provider columns, never to supply data the provider lacks.
  Emits one per-load summary ('varcode agrees with LENS on all N variants').

- variant_is_frameshift: frameshift from the variant's own ref/alt length delta
  (provider data, works for any modality), not a tool-specific column.

- maximal_mutant_span: for frameshifts, combine ALL of a variant's neoepitope
  rows — the union start (~frameshift onset) plus extension to the end of the
  translated context — so the full novel tail is marked mutant instead of just
  the representative neoepitope. Non-frameshift variants keep the local span.
  Fixes FYN/BTN3A3/TUBB2A showing only a partial mutant region.

pVACseq's best-peptide antigen is already span [0,len] (fully mutant), so only
the sanity check applies there.
iskandr added 3 commits May 21, 2026 12:37
…nt()

The stateful accumulator class was heavier than needed for a tally. Each loader
now collects (gene_ok, effect_ok, label) tuples in a plain list and calls the
shared free function log_varcode_agreement(results, source_name) once. Same
behavior, less ceremony.
…ne count

- Add unit tests for check_varcode_annotation (stubbed), log_varcode_agreement
  (agreement / mismatch / silent), and _write_run_summary (#1).
- log_varcode_agreement headline now counts DISTINCT mismatched variants
  (set union), not max(gene, effect) (#2).
- Document maximal_mutant_span's precondition: context ends at the new stop,
  no trailing wild-type (#3).
- check_varcode_annotation logs the swallowed exception at DEBUG so genuine
  effect-computation bugs stay diagnosable (#4).
Addresses review #1-#3 (done now, not deferred):
- run_summary lists per-modality reports: 'peptide/ (constructs + vaccine_report)'
  in the split-report layout (#1).
- log_transcript_resolution() and ranked_sorted_by_target_score() are shared by
  both the LENS and pVACseq loaders, replacing the near-duplicated 'didn't
  resolve' warning and the duplicated sort/return tail (#2).
- Extract the LENS per-variant build into _lens_ranked_entry() returning an
  _ExternalVariantEntry; ranked_from_lens_predictions is now a thin
  accumulate-and-summarize loop (#3). Dropped the dead n_skipped_post_stop
  locals() hack. Behavior-preserving (verified: identical funnel, FASTA,
  varcode-agreement, constructs).
@iskandr iskandr merged commit 7501937 into main May 22, 2026
13 of 14 checks passed
@iskandr iskandr deleted the fix-output-dir-creation branch May 22, 2026 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant