Improve LENS/pVACseq-import UX, logging, and vaccine naming#302
Merged
Conversation
write_neoepitope_report wrote the CSV/XLSX directly via pandas, which refuses to write into a non-existent directory. The vaccine-construct writers already os.makedirs their target dir, and arg_parser assumes the writer does so for tabular outputs, but this path never did -- so --output-csv foo/bar.csv crashed if foo/ didn't exist. Add _ensure_parent_dir and call it before both writes.
- Fix inferred-MHC-allele aggregation: read alleles from CandidateEpitope per_allele_scores (it has no .allele), so the linker optimizer + mhcflurry default actually engage on the external path. Surface them once up front. - Auto-populate ASCII + PDF reports under --output-dir (not just CSV). - Drop peptide-not-in-pep_context rows at load instead of late in pepsickle; excluded from report + constructs, warned once at load. - Source-specific dispatch label: [LENS import] / [pVACseq import] / [full pipeline]. - Rename 'Pepsickle credibility tagging' -> 'proteasomal-cleavage annotation'. Tests updated accordingly.
Replace the scattered per-stage count lines (input breakdown, skipped-no-coords, lack gene_name/transcript/reads) -- which double-counted the same non-coord rows across two independent passes -- with a single funnel that makes the two row uses explicit: every row is scored as a candidate epitope for the neoepitope report, while only genomic-coord rows are placed on the genome for construct ranking. Keep the SNV/INDEL anomaly WARNINGs. Tests repointed at the funnel line.
The LENS/pVACseq neoepitope CSVs mixed a 'No prediction' string in the affinity columns with blank numeric cells. Use blank everywhere so each CSV is internally consistent and loads cleanly into pandas/Excel; a blank prediction column reads as 'not predicted'. The rich XLSX report keeps its own 'No prediction' prose.
Interactive runs get a y/N prompt (declining aborts); non-interactive runs warn and proceed so batch pipelines don't hang. New --force-overwrite skips the prompt. Added to both arg parsers via add_output_args.
Multi-type runs keep per-modality constructs in subdirs, but the shared run context (inputs, MHC alleles in play, antigen counts, output map) now sits at the top level alongside the neoepitope table and ASCII/PDF reports.
Antigen names now read 'GENE (CATEGORY)' (e.g. 'FYN (INDEL)') instead of
'gene_chr_pos_ref_alt'. The specific mutation is appended only to disambiguate
when a gene contributes 2+ variants to the construct set:
'TP53 (SNV @ 1:100 A>T)'. Empty indel alleles render as '-' rather than
varcode's normalized '.'. FASTA construct headers gain a trailing genes= field
listing the construct's distinct gene(s). Category derives from the varcode
variant so it's identical on the pipeline and LENS/pVACseq paths.
Minimal-epitope tags fold into the parens ('GENE (SNV, epitope)') so the gene
stays parseable. Tests updated to the new format.
Review follow-ups + shared-state unification:
- Reword the output-dir overwrite prompt (it overwrites same-named files, not a wipe).
- Consistent funnel antigen_source ordering via one shared sort key (SNV/INDEL first).
- ranked_from_{lens,pvacseq}_predictions early-returns now ([], {}) to match the
2-tuple normal return so callers' unpacking can't crash on empty input.
- Verified the load-time mismatch drop has the SAME tolerance as the prior
pepsickle path (_closest_occurrence is exact-match, not fuzzy) — no change.
Shared epitope state: the VCF pipeline, LENS, and pVACseq paths already converge
on CandidateEpitope (candidate_epitopes_from_rows), DSL scoring, and VaccinePeptide
assembly. The remaining divergence was the neoepitope report table — three
hand-rolled row builders (where the 'No prediction'/blank drift originated). Extract
neoepitope_core_row() and route all three (_build_lens_report_row,
_build_pvacseq_report_row, report.make_minimal_neoepitope_report) through it, so the
shared columns + missing-value convention are produced by one piece of code. LENS
CSV column order preserved; pipeline report's WT affinity now blank (consistent).
The report's parameter dump leaked the argparse 'version' SUPPRESS sentinel and listed every unset (None/'') field (e.g. 'manufacturability: None'), which misrepresents what actually ran. Drop output paths, internal config-plumbing keys, the SUPPRESS sentinel, and unset values — leaving the effective parameters.
Multi-vaccine-type runs now write a modality-agnostic core report (antigen ranking + coverage, no construction blocks or manufacturability) at the top level, plus a per-modality report in each subdir (peptide/, mrna/) = core + that modality's construction detail. Manufacturability follows --manufacturability for the peptide report, is off for the core + mRNA reports. Single-modality / no --output-dir runs keep the one combined report. TemplateDataCreator gains an include_manufacturability override; the antigen-ranking blurb points to the per-modality reports when the current report has no construction sections.
Shared, modality-agnostic helpers in external_input (used by both the LENS and
pVACseq loaders, not duplicated per modality):
- check_varcode_annotation / AnnotationCheck: validate the provider's gene +
effect class against varcode on the provider's own transcript (positions are
isoform-dependent, so compare class not residue numbers). varcode is used only
to validate/interpret provider columns, never to supply data the provider lacks.
Emits one per-load summary ('varcode agrees with LENS on all N variants').
- variant_is_frameshift: frameshift from the variant's own ref/alt length delta
(provider data, works for any modality), not a tool-specific column.
- maximal_mutant_span: for frameshifts, combine ALL of a variant's neoepitope
rows — the union start (~frameshift onset) plus extension to the end of the
translated context — so the full novel tail is marked mutant instead of just
the representative neoepitope. Non-frameshift variants keep the local span.
Fixes FYN/BTN3A3/TUBB2A showing only a partial mutant region.
pVACseq's best-peptide antigen is already span [0,len] (fully mutant), so only
the sanity check applies there.
…nt() The stateful accumulator class was heavier than needed for a tally. Each loader now collects (gene_ok, effect_ok, label) tuples in a plain list and calls the shared free function log_varcode_agreement(results, source_name) once. Same behavior, less ceremony.
…ne count - Add unit tests for check_varcode_annotation (stubbed), log_varcode_agreement (agreement / mismatch / silent), and _write_run_summary (#1). - log_varcode_agreement headline now counts DISTINCT mismatched variants (set union), not max(gene, effect) (#2). - Document maximal_mutant_span's precondition: context ends at the new stop, no trailing wild-type (#3). - check_varcode_annotation logs the swallowed exception at DEBUG so genuine effect-computation bugs stay diagnosable (#4).
Addresses review #1-#3 (done now, not deferred): - run_summary lists per-modality reports: 'peptide/ (constructs + vaccine_report)' in the split-report layout (#1). - log_transcript_resolution() and ranked_sorted_by_target_score() are shared by both the LENS and pVACseq loaders, replacing the near-duplicated 'didn't resolve' warning and the duplicated sort/return tail (#2). - Extract the LENS per-variant build into _lens_ranked_entry() returning an _ExternalVariantEntry; ranked_from_lens_predictions is now a thin accumulate-and-summarize loop (#3). Dropped the dead n_skipped_post_stop locals() hack. Behavior-preserving (verified: identical funnel, FASTA, varcode-agreement, constructs).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Started as the parent-dir fix and broadened (per request) into a pass over the LENS/pVACseq-import experience.
Changes
Output / reports
Cannot save file into a non-existent directory).--output-dirnow auto-populates the ASCII text + PDF reports too, not just the CSV.run_summary.txt(inputs, MHC alleles in play, antigen counts, output map) so multi-type runs have shared run context at the top level alongside the neoepitope table and reports; per-modality constructs stay in subdirs.--output-dir(interactive only; batch runs warn and proceed). New--force-overwrite.Logging
[LENS import]/[pVACseq import]/[full pipeline]instead of[external].Correctness
CandidateEpitope.per_allele_scores(it has no.allele), so the per-junction linker optimizer + mhcflurry default actually engage on the external path. Surface the alleles once, up front.pep_contextrows at load (excluded from report + constructs, warned once) instead of late inside pepsickle.Naming / display
GENE (CATEGORY)(e.g.FYN (INDEL)), with the specific mutation appended only to disambiguate when a gene contributes 2+ variants (TP53 (SNV @ 1:100 A>T)). Empty indel alleles render as-rather than varcode's.. FASTA headers gain a trailinggenes=field.Docs
All 847 tests pass; tests updated for the intentional format changes.