Skip to content

feat(bench): frontier-comparison Sandboxes E/F/G/H/I as INACTIVE stubs#48

Merged
OpenCircuitDev merged 1 commit into
mainfrom
feat/bench-frontier-sandboxes-efghi
May 9, 2026
Merged

feat(bench): frontier-comparison Sandboxes E/F/G/H/I as INACTIVE stubs#48
OpenCircuitDev merged 1 commit into
mainfrom
feat/bench-frontier-sandboxes-efghi

Conversation

@OpenCircuitDev

Copy link
Copy Markdown
Owner

Summary

Spec v0.4 references Sandboxes E/F/G/H/I as load-bearing measurements for the rev's locked decisions, but the slots didn't exist on disk. Adds them as INACTIVE stubs so the harness has targets when the underlying tech matures.

ID Hypothesis Spec lock
E Schema compression cuts MCP-tool requests 30-60% at ≤2pp accuracy loss row 21
F Full memory stack ≥92% on long-horizon QA at ≤8K tokens vs frontier 1M-token row 26+27 (load-bearing for Effective-Context Triad)
G zstd-6 on mesh payloads ≥60% bandwidth reduction at ≤2ms overhead row 28
H fp8 activation transfer ≥99% fp16-baseline quality at half bandwidth row 28 (gates v6 sharded inference)
I SQLCipher AES-256 + Argon2id ≤15% latency, no accuracy regression row 29

Each sandbox documents

  • Claim with explicit confirm/refute thresholds
  • `comparison_anchor` so comparisons stay concrete
  • `decision_rule` mapping verdicts to next-action
  • `blocked_on` list of what needs to land before flipping ACTIVE

Test

`bench dry-run-all` passes: 14 sandboxes total (1 ACTIVE, 13 INACTIVE).

🤖 Generated with Claude Code

Spec v0.4 references Sandboxes E/F/G/H/I as load-bearing measurements
for the rev's locked decisions, but the sandbox slots didn't exist on
disk. Adds them as INACTIVE stubs so the harness has targets when the
underlying tech matures.

Sandbox E — Schema compression token impact (spec row 21):
  cuts MCP-tool-rich requests by 30-60% with <=2pp accuracy loss

Sandbox F — Memory Palace effective-corpus (spec rows 26 + 27):
  full memory stack >=92% on long-horizon QA at <=8K tokens, vs
  frontier 1M-token model paying 200x more per query. THE load-bearing
  measurement of the v0.4 Effective-Context Triad.

Sandbox G — Wire compression bandwidth (spec row 28):
  zstd-6 on mesh payloads >=60% bandwidth reduction at <=2ms overhead

Sandbox H — fp8 activation transfer (spec row 28):
  >=99% fp16-baseline output quality at half the wire bandwidth.
  Spec explicitly gates v6 sharded-inference rollout on H confirmation.

Sandbox I — Mem0 SQLCipher overhead (spec row 29):
  AES-256 + Argon2id key derivation adds <=15% latency, no accuracy
  regression. Becomes load-bearing on remote-VM deployments (row 31).

Each sandbox documents:
  - claim with explicit confirm/refute thresholds
  - comparison_anchor (so the comparison is always concrete)
  - decision_rule mapping verdicts to next-action
  - blocked_on list — what tech needs to land before the sandbox flips
    to ACTIVE

Bench dry-run-all passes: 14 sandboxes total (1 ACTIVE, 13 INACTIVE).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@OpenCircuitDev OpenCircuitDev merged commit 8714554 into main May 9, 2026
1 check passed
@OpenCircuitDev OpenCircuitDev deleted the feat/bench-frontier-sandboxes-efghi branch May 9, 2026 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants