Skip to content

Proposal: canonical bundle_hash to bind raw responses ↔ grades for post-hoc audit #6

@topeuph-ai

Description

@topeuph-ai

Hi maintainers,

Publishing per-sample raw responses alongside grades in putnam_like and live_math is genuinely unusual — full sample.md + grade.json pairs are rarer and more useful than aggregate leaderboard numbers, so thank you.

One structural gap I noticed: there's currently no cryptographic link between a sample.md and its paired grade.json. Issues #2 and #3 both flag correct rubric errors, but they surface a deeper problem: once a grade is corrected in-place, there's no record of which grade version was in effect when any external comparison was made. Results benchmarked against putnam_like before those fixes are silently non-comparable to results benchmarked after.

The structural gap:

Set_1/A1/samples/gemini-2-5-pro_20250718/
├── sample.md           # raw model response
└── grade_...json       # assigned by Gemini API grader; no hash binds it to the response above

Nothing in the repo anchors these two files together at a point in time. A grade can be corrected (legitimately) or drift (accidentally) without any external signal.

A lightweight fix: a committed bundle_hash per benchmark

I've been building valichord_attestation, a small MIT-licensed Python library that generates canonical Merkle commitments over per-sample evaluation outputs. It has no external dependencies beyond jcs (the RFC 8785 encoder) and the standard library.

The mechanics:

  • RFC 8785 (JCS) encoding — deterministic JSON byte representation, key-order and float-representation independent
  • SHA-256 Merkle root over per-sample {id, response, grade} dicts — commits to both files simultaneously
  • bundle_hash = SHA-256(JCS(bundle)) — a single 64-char hex string covering the Merkle root, aggregate metrics, model, and task

For putnam_like, the adapter is direct — each "sample" is one (problem × model) evaluation pair:

from valichord_attestation import build_bundle, hash_bundle
from pathlib import Path

samples = []
for problem_dir in sorted(Path("putnam_like").glob("Set_*/*/samples/*/")):
    response_file = next(problem_dir.glob("sample*.md"), None)
    grade_file    = next(problem_dir.glob("grade*.json"), None)
    if response_file and grade_file:
        samples.append({
            "id":       str(problem_dir.relative_to("putnam_like")),
            "response": response_file.read_text(),
            "grade":    grade_file.read_text(),
        })

bundle = build_bundle(
    model_id="putnam_like-v1",
    task_id="putnam_like",
    raw_metrics=[{"key": "mean_grade", "value": mean_grade}],
    samples=samples,
    repo_commit="<git SHA at time of this publication>",
)
print(hash_bundle(bundle))  # 64-char hex; publish this in the README

What this enables without changing your evaluation workflow:

  1. Grade-to-response binding. The Merkle root commits to both the response text and the grade text for every sample simultaneously. A silently modified grade changes the root and therefore the bundle_hash.

  2. Version anchoring for corrections. When grades are corrected (as in putnam_like_set6_a1 - mistake in grading_scheme.md #2 and Mistake in solution to putnam_like_set3_b4 #3), publishing a new bundle_hash makes the correction explicit and traceable. Downstream users know exactly which hash their results were computed against.

  3. Partial-download audit. The library includes a challenge-response protocol: a verifier sends a nonce and requests k samples; the challenger returns Merkle proof paths for those k samples (derived from HMAC-SHA256(nonce, bundle_hash) — the verifier controls selection, so the holder cannot cherry-pick). The verifier downloads only those k files and checks them against the published root. For a 2,214-sample corpus, k=60 gives >99% confidence of detecting any single-sample discrepancy.

The minimal proposal

Even a single bundle_hash: line in each benchmark's README would close the version-anchoring gap. I'd be happy to submit a PR with:

  • attestation/adapter.py — reads the putnam_like directory structure, produces a bundle.json
  • attestation/bundle.json — committed hash for the current state of the benchmark
  • attestation/verify.py — ~15-line script: download any sample, check its Merkle proof against the committed root

Happy to scope it to just the hash in the README if a full adapter adds more maintenance than warranted. The key value is a published commitment that makes any grade-level change in the corpus visible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions