Proposal: canonical bundle_hash to bind raw responses ↔ grades for post-hoc audit

Hi maintainers,

Publishing per-sample raw responses alongside grades in `putnam_like` and `live_math` is genuinely unusual — full `sample.md` + `grade.json` pairs are rarer and more useful than aggregate leaderboard numbers, so thank you.

One structural gap I noticed: there's currently no cryptographic link between a `sample.md` and its paired `grade.json`. Issues #2 and #3 both flag correct rubric errors, but they surface a deeper problem: once a grade is corrected in-place, there's no record of which grade version was in effect when any external comparison was made. Results benchmarked against `putnam_like` before those fixes are silently non-comparable to results benchmarked after.

**The structural gap:**

```
Set_1/A1/samples/gemini-2-5-pro_20250718/
├── sample.md           # raw model response
└── grade_...json       # assigned by Gemini API grader; no hash binds it to the response above
```

Nothing in the repo anchors these two files together at a point in time. A grade can be corrected (legitimately) or drift (accidentally) without any external signal.

**A lightweight fix: a committed `bundle_hash` per benchmark**

I've been building [`valichord_attestation`](https://github.com/topeuph-ai/ValiChord/tree/main/valichord_attestation), a small MIT-licensed Python library that generates canonical Merkle commitments over per-sample evaluation outputs. It has no external dependencies beyond `jcs` (the RFC 8785 encoder) and the standard library.

The mechanics:

- **RFC 8785 (JCS) encoding** — deterministic JSON byte representation, key-order and float-representation independent
- **SHA-256 Merkle root** over per-sample `{id, response, grade}` dicts — commits to both files simultaneously
- **`bundle_hash`** = `SHA-256(JCS(bundle))` — a single 64-char hex string covering the Merkle root, aggregate metrics, model, and task

For `putnam_like`, the adapter is direct — each "sample" is one (problem × model) evaluation pair:

```python
from valichord_attestation import build_bundle, hash_bundle
from pathlib import Path

samples = []
for problem_dir in sorted(Path("putnam_like").glob("Set_*/*/samples/*/")):
    response_file = next(problem_dir.glob("sample*.md"), None)
    grade_file    = next(problem_dir.glob("grade*.json"), None)
    if response_file and grade_file:
        samples.append({
            "id":       str(problem_dir.relative_to("putnam_like")),
            "response": response_file.read_text(),
            "grade":    grade_file.read_text(),
        })

bundle = build_bundle(
    model_id="putnam_like-v1",
    task_id="putnam_like",
    raw_metrics=[{"key": "mean_grade", "value": mean_grade}],
    samples=samples,
    repo_commit="<git SHA at time of this publication>",
)
print(hash_bundle(bundle))  # 64-char hex; publish this in the README
```

**What this enables without changing your evaluation workflow:**

1. **Grade-to-response binding.** The Merkle root commits to both the response text and the grade text for every sample simultaneously. A silently modified grade changes the root and therefore the `bundle_hash`.

2. **Version anchoring for corrections.** When grades are corrected (as in #2 and #3), publishing a new `bundle_hash` makes the correction explicit and traceable. Downstream users know exactly which hash their results were computed against.

3. **Partial-download audit.** The library includes a challenge-response protocol: a verifier sends a nonce and requests k samples; the challenger returns Merkle proof paths for those k samples (derived from `HMAC-SHA256(nonce, bundle_hash)` — the verifier controls selection, so the holder cannot cherry-pick). The verifier downloads only those k files and checks them against the published root. For a 2,214-sample corpus, k=60 gives >99% confidence of detecting any single-sample discrepancy.

**The minimal proposal**

Even a single `bundle_hash:` line in each benchmark's README would close the version-anchoring gap. I'd be happy to submit a PR with:

- `attestation/adapter.py` — reads the `putnam_like` directory structure, produces a `bundle.json`
- `attestation/bundle.json` — committed hash for the current state of the benchmark
- `attestation/verify.py` — ~15-line script: download any sample, check its Merkle proof against the committed root

Happy to scope it to just the hash in the README if a full adapter adds more maintenance than warranted. The key value is a published commitment that makes any grade-level change in the corpus visible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: canonical bundle_hash to bind raw responses ↔ grades for post-hoc audit #6

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Proposal: canonical bundle_hash to bind raw responses ↔ grades for post-hoc audit #6

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions