feat(bench): frontier-comparison Sandboxes E/F/G/H/I as INACTIVE stubs#48
Merged
Merged
Conversation
Spec v0.4 references Sandboxes E/F/G/H/I as load-bearing measurements
for the rev's locked decisions, but the sandbox slots didn't exist on
disk. Adds them as INACTIVE stubs so the harness has targets when the
underlying tech matures.
Sandbox E — Schema compression token impact (spec row 21):
cuts MCP-tool-rich requests by 30-60% with <=2pp accuracy loss
Sandbox F — Memory Palace effective-corpus (spec rows 26 + 27):
full memory stack >=92% on long-horizon QA at <=8K tokens, vs
frontier 1M-token model paying 200x more per query. THE load-bearing
measurement of the v0.4 Effective-Context Triad.
Sandbox G — Wire compression bandwidth (spec row 28):
zstd-6 on mesh payloads >=60% bandwidth reduction at <=2ms overhead
Sandbox H — fp8 activation transfer (spec row 28):
>=99% fp16-baseline output quality at half the wire bandwidth.
Spec explicitly gates v6 sharded-inference rollout on H confirmation.
Sandbox I — Mem0 SQLCipher overhead (spec row 29):
AES-256 + Argon2id key derivation adds <=15% latency, no accuracy
regression. Becomes load-bearing on remote-VM deployments (row 31).
Each sandbox documents:
- claim with explicit confirm/refute thresholds
- comparison_anchor (so the comparison is always concrete)
- decision_rule mapping verdicts to next-action
- blocked_on list — what tech needs to land before the sandbox flips
to ACTIVE
Bench dry-run-all passes: 14 sandboxes total (1 ACTIVE, 13 INACTIVE).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Spec v0.4 references Sandboxes E/F/G/H/I as load-bearing measurements for the rev's locked decisions, but the slots didn't exist on disk. Adds them as INACTIVE stubs so the harness has targets when the underlying tech matures.
Each sandbox documents
Test
`bench dry-run-all` passes: 14 sandboxes total (1 ACTIVE, 13 INACTIVE).
🤖 Generated with Claude Code