You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Publishing per-sample raw responses alongside grades in putnam_like and live_math is genuinely unusual — full sample.md + grade.json pairs are rarer and more useful than aggregate leaderboard numbers, so thank you.
One structural gap I noticed: there's currently no cryptographic link between a sample.md and its paired grade.json. Issues #2 and #3 both flag correct rubric errors, but they surface a deeper problem: once a grade is corrected in-place, there's no record of which grade version was in effect when any external comparison was made. Results benchmarked against putnam_like before those fixes are silently non-comparable to results benchmarked after.
The structural gap:
Set_1/A1/samples/gemini-2-5-pro_20250718/
├── sample.md # raw model response
└── grade_...json # assigned by Gemini API grader; no hash binds it to the response above
Nothing in the repo anchors these two files together at a point in time. A grade can be corrected (legitimately) or drift (accidentally) without any external signal.
A lightweight fix: a committed bundle_hash per benchmark
I've been building valichord_attestation, a small MIT-licensed Python library that generates canonical Merkle commitments over per-sample evaluation outputs. It has no external dependencies beyond jcs (the RFC 8785 encoder) and the standard library.
SHA-256 Merkle root over per-sample {id, response, grade} dicts — commits to both files simultaneously
bundle_hash = SHA-256(JCS(bundle)) — a single 64-char hex string covering the Merkle root, aggregate metrics, model, and task
For putnam_like, the adapter is direct — each "sample" is one (problem × model) evaluation pair:
fromvalichord_attestationimportbuild_bundle, hash_bundlefrompathlibimportPathsamples= []
forproblem_dirinsorted(Path("putnam_like").glob("Set_*/*/samples/*/")):
response_file=next(problem_dir.glob("sample*.md"), None)
grade_file=next(problem_dir.glob("grade*.json"), None)
ifresponse_fileandgrade_file:
samples.append({
"id": str(problem_dir.relative_to("putnam_like")),
"response": response_file.read_text(),
"grade": grade_file.read_text(),
})
bundle=build_bundle(
model_id="putnam_like-v1",
task_id="putnam_like",
raw_metrics=[{"key": "mean_grade", "value": mean_grade}],
samples=samples,
repo_commit="<git SHA at time of this publication>",
)
print(hash_bundle(bundle)) # 64-char hex; publish this in the README
What this enables without changing your evaluation workflow:
Grade-to-response binding. The Merkle root commits to both the response text and the grade text for every sample simultaneously. A silently modified grade changes the root and therefore the bundle_hash.
Partial-download audit. The library includes a challenge-response protocol: a verifier sends a nonce and requests k samples; the challenger returns Merkle proof paths for those k samples (derived from HMAC-SHA256(nonce, bundle_hash) — the verifier controls selection, so the holder cannot cherry-pick). The verifier downloads only those k files and checks them against the published root. For a 2,214-sample corpus, k=60 gives >99% confidence of detecting any single-sample discrepancy.
The minimal proposal
Even a single bundle_hash: line in each benchmark's README would close the version-anchoring gap. I'd be happy to submit a PR with:
attestation/adapter.py — reads the putnam_like directory structure, produces a bundle.json
attestation/bundle.json — committed hash for the current state of the benchmark
attestation/verify.py — ~15-line script: download any sample, check its Merkle proof against the committed root
Happy to scope it to just the hash in the README if a full adapter adds more maintenance than warranted. The key value is a published commitment that makes any grade-level change in the corpus visible.
Hi maintainers,
Publishing per-sample raw responses alongside grades in
putnam_likeandlive_mathis genuinely unusual — fullsample.md+grade.jsonpairs are rarer and more useful than aggregate leaderboard numbers, so thank you.One structural gap I noticed: there's currently no cryptographic link between a
sample.mdand its pairedgrade.json. Issues #2 and #3 both flag correct rubric errors, but they surface a deeper problem: once a grade is corrected in-place, there's no record of which grade version was in effect when any external comparison was made. Results benchmarked againstputnam_likebefore those fixes are silently non-comparable to results benchmarked after.The structural gap:
Nothing in the repo anchors these two files together at a point in time. A grade can be corrected (legitimately) or drift (accidentally) without any external signal.
A lightweight fix: a committed
bundle_hashper benchmarkI've been building
valichord_attestation, a small MIT-licensed Python library that generates canonical Merkle commitments over per-sample evaluation outputs. It has no external dependencies beyondjcs(the RFC 8785 encoder) and the standard library.The mechanics:
{id, response, grade}dicts — commits to both files simultaneouslybundle_hash=SHA-256(JCS(bundle))— a single 64-char hex string covering the Merkle root, aggregate metrics, model, and taskFor
putnam_like, the adapter is direct — each "sample" is one (problem × model) evaluation pair:What this enables without changing your evaluation workflow:
Grade-to-response binding. The Merkle root commits to both the response text and the grade text for every sample simultaneously. A silently modified grade changes the root and therefore the
bundle_hash.Version anchoring for corrections. When grades are corrected (as in putnam_like_set6_a1 - mistake in grading_scheme.md #2 and Mistake in solution to putnam_like_set3_b4 #3), publishing a new
bundle_hashmakes the correction explicit and traceable. Downstream users know exactly which hash their results were computed against.Partial-download audit. The library includes a challenge-response protocol: a verifier sends a nonce and requests k samples; the challenger returns Merkle proof paths for those k samples (derived from
HMAC-SHA256(nonce, bundle_hash)— the verifier controls selection, so the holder cannot cherry-pick). The verifier downloads only those k files and checks them against the published root. For a 2,214-sample corpus, k=60 gives >99% confidence of detecting any single-sample discrepancy.The minimal proposal
Even a single
bundle_hash:line in each benchmark's README would close the version-anchoring gap. I'd be happy to submit a PR with:attestation/adapter.py— reads theputnam_likedirectory structure, produces abundle.jsonattestation/bundle.json— committed hash for the current state of the benchmarkattestation/verify.py— ~15-line script: download any sample, check its Merkle proof against the committed rootHappy to scope it to just the hash in the README if a full adapter adds more maintenance than warranted. The key value is a published commitment that makes any grade-level change in the corpus visible.