DSL: explicit per-mhc_dependence peptide-level projection

## Problem

Topiary's DSL accessors (`Affinity`, `Presentation`, `Stability`, `Processing`) read one value per peptide-allele group. But the **right per-peptide reduction depends on `mhc_dependence`** — and today that dispatch is implicit, partial, and only correct for the most common case.

| `mhc_dependence` | Rows per peptide | Correct per-peptide projection |
|---|---|---|
| `single_allele` | one row per (peptide, allele) | `BestAlleleField` — aggregate across alleles for the peptide |
| `haplotype` | one row per peptide | direct read — the row already IS peptide-level |
| `none` | one row per peptide | direct read — same |

`BestAlleleField` exists today for the `single_allele` case (`Affinity.best_value` / `best_score` / `best_rank`). Haplotype and `none` work *implicitly* because there's only one row per peptide in those cases, so any group reduction is a no-op. **But that's coincidental, not principled.** If a future MHCflurry mode produced multiple haplotype rows per peptide (e.g., different `allele_set`s scored separately), the implicit pattern would silently pick the first one and produce wrong results.

## Why this matters

This is the query-layer half of the haplotype faithfulness story. Without it, even with `allele_set` in the schema (issue #168), cross-kind composition in the DSL can't be elegant. A sort expression like:

```python
sort_by=[
    Presentation.score,      # mhcflurry haplotype: 1 row per peptide
    Affinity.best_score,     # netmhcpan single-allele: aggregate 6 rows per peptide
]
```

…should "just work" — both expressions evaluating to one value per peptide, regardless of how many rows their kinds happen to produce. Today the user has to know which kind needs `BestAlleleField` and which doesn't, and has to remember to use `.best_*` on the right one. Get it wrong and the DSL silently reads the first row in a group.

## Proposal

**Make the per-peptide projection explicit per `mhc_dependence`.** Two possible API shapes:

### Option A: a `peptide_view(node)` helper

```python
from topiary import peptide_view

sort_by=[
    peptide_view(Presentation.score),
    peptide_view(Affinity.score),   # auto-picks best across alleles
]
```

`peptide_view(node)` inspects the node's kind, looks up `mhc_dependence` from the context's `kind_support`, and:
- `single_allele` → wraps in `BestAlleleField`-style aggregation
- `haplotype` → reads direct (with a sanity check that exactly one row per peptide exists)
- `none` → reads direct (same check)

Explicit, opt-in, doesn't change existing behavior.

### Option B: DSL kind accessors auto-reduce at the peptide-group level

`Affinity.score` evaluated at the peptide-group level (e.g., inside a `sort_by` or `evaluate_scores(group_keys=["peptide"])` call) auto-dispatches per `mhc_dependence`. The DSL contract becomes: "kind accessors return one value per group; the reduction is correct for the kind's `mhc_dependence`."

Cleaner at the use site. But changes the semantics of existing accessors — `Affinity.score` in a `single_allele` context today reads the first allele's row; under this proposal it would aggregate.

### Recommendation

**Option A** for the first cut — explicit, non-breaking. If it proves correct and useful, the DSL can adopt Option B as a follow-up by making `peptide_view` implicit.

## Validation guarantees

Both options should error helpfully when the input is inconsistent:

- A `haplotype` kind with multiple rows per peptide (e.g., two different `allele_set`s for the same peptide) → error pointing the user at filtering to one set first.
- A `single_allele` kind without `BestAlleleField`-resolvable rows (e.g., all `value` NaN) → error or NaN-pass-through per existing topiary conventions.
- A `none` kind with duplicate rows per peptide → likely a data error; surface it.

## Depends on

- **Issue #168 (allele_set column)** — without faithful haplotype representation, the `haplotype` branch of the projection table can't be implemented correctly. The current "best allele in haplotype" workaround makes haplotype rows look identical to single-allele rows at the row level.

## See also

- #168 — schema half of this story.
- #167 — pVACseq loader, ships independently of both. The vaxrank consumer use case (mostly single-allele affinity comparison across sources) works fine without this projection layer; only haplotype + cross-kind composition needs it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DSL: explicit per-mhc_dependence peptide-level projection #169

Problem

Why this matters

Proposal

Option A: a `peptide_view(node)` helper

Option B: DSL kind accessors auto-reduce at the peptide-group level

Recommendation

Validation guarantees

Depends on

See also

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

`mhc_dependence`	Rows per peptide	Correct per-peptide projection
`single_allele`	one row per (peptide, allele)	`BestAlleleField` — aggregate across alleles for the peptide
`haplotype`	one row per peptide	direct read — the row already IS peptide-level
`none`	one row per peptide	direct read — same

DSL: explicit per-mhc_dependence peptide-level projection #169

Description

Problem

Why this matters

Proposal

Option A: a peptide_view(node) helper

Option B: DSL kind accessors auto-reduce at the peptide-group level

Recommendation

Validation guarantees

Depends on

See also

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Option A: a `peptide_view(node)` helper