feat(bench): frontier-comparison Sandboxes E/F/G/H/I as INACTIVE stubs by OpenCircuitDev · Pull Request #48 · OpenCircuitDev/opencircuitmodel

OpenCircuitDev · 2026-05-09T22:32:29Z

Summary

Spec v0.4 references Sandboxes E/F/G/H/I as load-bearing measurements for the rev's locked decisions, but the slots didn't exist on disk. Adds them as INACTIVE stubs so the harness has targets when the underlying tech matures.

ID	Hypothesis	Spec lock
E	Schema compression cuts MCP-tool requests 30-60% at ≤2pp accuracy loss	row 21
F	Full memory stack ≥92% on long-horizon QA at ≤8K tokens vs frontier 1M-token	row 26+27 (load-bearing for Effective-Context Triad)
G	zstd-6 on mesh payloads ≥60% bandwidth reduction at ≤2ms overhead	row 28
H	fp8 activation transfer ≥99% fp16-baseline quality at half bandwidth	row 28 (gates v6 sharded inference)
I	SQLCipher AES-256 + Argon2id ≤15% latency, no accuracy regression	row 29

Each sandbox documents

Claim with explicit confirm/refute thresholds
`comparison_anchor` so comparisons stay concrete
`decision_rule` mapping verdicts to next-action
`blocked_on` list of what needs to land before flipping ACTIVE

Test

`bench dry-run-all` passes: 14 sandboxes total (1 ACTIVE, 13 INACTIVE).

🤖 Generated with Claude Code

Spec v0.4 references Sandboxes E/F/G/H/I as load-bearing measurements for the rev's locked decisions, but the sandbox slots didn't exist on disk. Adds them as INACTIVE stubs so the harness has targets when the underlying tech matures. Sandbox E — Schema compression token impact (spec row 21): cuts MCP-tool-rich requests by 30-60% with <=2pp accuracy loss Sandbox F — Memory Palace effective-corpus (spec rows 26 + 27): full memory stack >=92% on long-horizon QA at <=8K tokens, vs frontier 1M-token model paying 200x more per query. THE load-bearing measurement of the v0.4 Effective-Context Triad. Sandbox G — Wire compression bandwidth (spec row 28): zstd-6 on mesh payloads >=60% bandwidth reduction at <=2ms overhead Sandbox H — fp8 activation transfer (spec row 28): >=99% fp16-baseline output quality at half the wire bandwidth. Spec explicitly gates v6 sharded-inference rollout on H confirmation. Sandbox I — Mem0 SQLCipher overhead (spec row 29): AES-256 + Argon2id key derivation adds <=15% latency, no accuracy regression. Becomes load-bearing on remote-VM deployments (row 31). Each sandbox documents: - claim with explicit confirm/refute thresholds - comparison_anchor (so the comparison is always concrete) - decision_rule mapping verdicts to next-action - blocked_on list — what tech needs to land before the sandbox flips to ACTIVE Bench dry-run-all passes: 14 sandboxes total (1 ACTIVE, 13 INACTIVE). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

OpenCircuitDev merged commit 8714554 into main May 9, 2026
1 check passed

OpenCircuitDev deleted the feat/bench-frontier-sandboxes-efghi branch May 9, 2026 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): frontier-comparison Sandboxes E/F/G/H/I as INACTIVE stubs#48

feat(bench): frontier-comparison Sandboxes E/F/G/H/I as INACTIVE stubs#48
OpenCircuitDev merged 1 commit into
mainfrom
feat/bench-frontier-sandboxes-efghi

OpenCircuitDev commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

OpenCircuitDev commented May 9, 2026

Summary

Each sandbox documents

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants