Problem
Topiary's DSL accessors (Affinity, Presentation, Stability, Processing) read one value per peptide-allele group. But the right per-peptide reduction depends on mhc_dependence — and today that dispatch is implicit, partial, and only correct for the most common case.
mhc_dependence |
Rows per peptide |
Correct per-peptide projection |
single_allele |
one row per (peptide, allele) |
BestAlleleField — aggregate across alleles for the peptide |
haplotype |
one row per peptide |
direct read — the row already IS peptide-level |
none |
one row per peptide |
direct read — same |
BestAlleleField exists today for the single_allele case (Affinity.best_value / best_score / best_rank). Haplotype and none work implicitly because there's only one row per peptide in those cases, so any group reduction is a no-op. But that's coincidental, not principled. If a future MHCflurry mode produced multiple haplotype rows per peptide (e.g., different allele_sets scored separately), the implicit pattern would silently pick the first one and produce wrong results.
Why this matters
This is the query-layer half of the haplotype faithfulness story. Without it, even with allele_set in the schema (issue #168), cross-kind composition in the DSL can't be elegant. A sort expression like:
sort_by=[
Presentation.score, # mhcflurry haplotype: 1 row per peptide
Affinity.best_score, # netmhcpan single-allele: aggregate 6 rows per peptide
]
…should "just work" — both expressions evaluating to one value per peptide, regardless of how many rows their kinds happen to produce. Today the user has to know which kind needs BestAlleleField and which doesn't, and has to remember to use .best_* on the right one. Get it wrong and the DSL silently reads the first row in a group.
Proposal
Make the per-peptide projection explicit per mhc_dependence. Two possible API shapes:
Option A: a peptide_view(node) helper
from topiary import peptide_view
sort_by=[
peptide_view(Presentation.score),
peptide_view(Affinity.score), # auto-picks best across alleles
]
peptide_view(node) inspects the node's kind, looks up mhc_dependence from the context's kind_support, and:
single_allele → wraps in BestAlleleField-style aggregation
haplotype → reads direct (with a sanity check that exactly one row per peptide exists)
none → reads direct (same check)
Explicit, opt-in, doesn't change existing behavior.
Option B: DSL kind accessors auto-reduce at the peptide-group level
Affinity.score evaluated at the peptide-group level (e.g., inside a sort_by or evaluate_scores(group_keys=["peptide"]) call) auto-dispatches per mhc_dependence. The DSL contract becomes: "kind accessors return one value per group; the reduction is correct for the kind's mhc_dependence."
Cleaner at the use site. But changes the semantics of existing accessors — Affinity.score in a single_allele context today reads the first allele's row; under this proposal it would aggregate.
Recommendation
Option A for the first cut — explicit, non-breaking. If it proves correct and useful, the DSL can adopt Option B as a follow-up by making peptide_view implicit.
Validation guarantees
Both options should error helpfully when the input is inconsistent:
- A
haplotype kind with multiple rows per peptide (e.g., two different allele_sets for the same peptide) → error pointing the user at filtering to one set first.
- A
single_allele kind without BestAlleleField-resolvable rows (e.g., all value NaN) → error or NaN-pass-through per existing topiary conventions.
- A
none kind with duplicate rows per peptide → likely a data error; surface it.
Depends on
See also
Problem
Topiary's DSL accessors (
Affinity,Presentation,Stability,Processing) read one value per peptide-allele group. But the right per-peptide reduction depends onmhc_dependence— and today that dispatch is implicit, partial, and only correct for the most common case.mhc_dependencesingle_alleleBestAlleleField— aggregate across alleles for the peptidehaplotypenoneBestAlleleFieldexists today for thesingle_allelecase (Affinity.best_value/best_score/best_rank). Haplotype andnonework implicitly because there's only one row per peptide in those cases, so any group reduction is a no-op. But that's coincidental, not principled. If a future MHCflurry mode produced multiple haplotype rows per peptide (e.g., differentallele_sets scored separately), the implicit pattern would silently pick the first one and produce wrong results.Why this matters
This is the query-layer half of the haplotype faithfulness story. Without it, even with
allele_setin the schema (issue #168), cross-kind composition in the DSL can't be elegant. A sort expression like:…should "just work" — both expressions evaluating to one value per peptide, regardless of how many rows their kinds happen to produce. Today the user has to know which kind needs
BestAlleleFieldand which doesn't, and has to remember to use.best_*on the right one. Get it wrong and the DSL silently reads the first row in a group.Proposal
Make the per-peptide projection explicit per
mhc_dependence. Two possible API shapes:Option A: a
peptide_view(node)helperpeptide_view(node)inspects the node's kind, looks upmhc_dependencefrom the context'skind_support, and:single_allele→ wraps inBestAlleleField-style aggregationhaplotype→ reads direct (with a sanity check that exactly one row per peptide exists)none→ reads direct (same check)Explicit, opt-in, doesn't change existing behavior.
Option B: DSL kind accessors auto-reduce at the peptide-group level
Affinity.scoreevaluated at the peptide-group level (e.g., inside asort_byorevaluate_scores(group_keys=["peptide"])call) auto-dispatches permhc_dependence. The DSL contract becomes: "kind accessors return one value per group; the reduction is correct for the kind'smhc_dependence."Cleaner at the use site. But changes the semantics of existing accessors —
Affinity.scorein asingle_allelecontext today reads the first allele's row; under this proposal it would aggregate.Recommendation
Option A for the first cut — explicit, non-breaking. If it proves correct and useful, the DSL can adopt Option B as a follow-up by making
peptide_viewimplicit.Validation guarantees
Both options should error helpfully when the input is inconsistent:
haplotypekind with multiple rows per peptide (e.g., two differentallele_sets for the same peptide) → error pointing the user at filtering to one set first.single_allelekind withoutBestAlleleField-resolvable rows (e.g., allvalueNaN) → error or NaN-pass-through per existing topiary conventions.nonekind with duplicate rows per peptide → likely a data error; surface it.Depends on
haplotypebranch of the projection table can't be implemented correctly. The current "best allele in haplotype" workaround makes haplotype rows look identical to single-allele rows at the row level.See also