Skip to content

DSL: explicit per-mhc_dependence peptide-level projection #169

@iskandr

Description

@iskandr

Problem

Topiary's DSL accessors (Affinity, Presentation, Stability, Processing) read one value per peptide-allele group. But the right per-peptide reduction depends on mhc_dependence — and today that dispatch is implicit, partial, and only correct for the most common case.

mhc_dependence Rows per peptide Correct per-peptide projection
single_allele one row per (peptide, allele) BestAlleleField — aggregate across alleles for the peptide
haplotype one row per peptide direct read — the row already IS peptide-level
none one row per peptide direct read — same

BestAlleleField exists today for the single_allele case (Affinity.best_value / best_score / best_rank). Haplotype and none work implicitly because there's only one row per peptide in those cases, so any group reduction is a no-op. But that's coincidental, not principled. If a future MHCflurry mode produced multiple haplotype rows per peptide (e.g., different allele_sets scored separately), the implicit pattern would silently pick the first one and produce wrong results.

Why this matters

This is the query-layer half of the haplotype faithfulness story. Without it, even with allele_set in the schema (issue #168), cross-kind composition in the DSL can't be elegant. A sort expression like:

sort_by=[
    Presentation.score,      # mhcflurry haplotype: 1 row per peptide
    Affinity.best_score,     # netmhcpan single-allele: aggregate 6 rows per peptide
]

…should "just work" — both expressions evaluating to one value per peptide, regardless of how many rows their kinds happen to produce. Today the user has to know which kind needs BestAlleleField and which doesn't, and has to remember to use .best_* on the right one. Get it wrong and the DSL silently reads the first row in a group.

Proposal

Make the per-peptide projection explicit per mhc_dependence. Two possible API shapes:

Option A: a peptide_view(node) helper

from topiary import peptide_view

sort_by=[
    peptide_view(Presentation.score),
    peptide_view(Affinity.score),   # auto-picks best across alleles
]

peptide_view(node) inspects the node's kind, looks up mhc_dependence from the context's kind_support, and:

  • single_allele → wraps in BestAlleleField-style aggregation
  • haplotype → reads direct (with a sanity check that exactly one row per peptide exists)
  • none → reads direct (same check)

Explicit, opt-in, doesn't change existing behavior.

Option B: DSL kind accessors auto-reduce at the peptide-group level

Affinity.score evaluated at the peptide-group level (e.g., inside a sort_by or evaluate_scores(group_keys=["peptide"]) call) auto-dispatches per mhc_dependence. The DSL contract becomes: "kind accessors return one value per group; the reduction is correct for the kind's mhc_dependence."

Cleaner at the use site. But changes the semantics of existing accessors — Affinity.score in a single_allele context today reads the first allele's row; under this proposal it would aggregate.

Recommendation

Option A for the first cut — explicit, non-breaking. If it proves correct and useful, the DSL can adopt Option B as a follow-up by making peptide_view implicit.

Validation guarantees

Both options should error helpfully when the input is inconsistent:

  • A haplotype kind with multiple rows per peptide (e.g., two different allele_sets for the same peptide) → error pointing the user at filtering to one set first.
  • A single_allele kind without BestAlleleField-resolvable rows (e.g., all value NaN) → error or NaN-pass-through per existing topiary conventions.
  • A none kind with duplicate rows per peptide → likely a data error; surface it.

Depends on

See also

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions