Summary
Eight BOI pipeline experiments were designed and pre-registered for the 2026-04-29 study. Zero experiments ran. The entire experimental run was blocked by a host-isolation constraint built into the bench harness: run-local.sh requires pgrep -x claude = 0, but BOI workers are Claude processes — so the check fires reliably during every worker session (19+ PIDs observed).
Full recommendations paper: me/decisions/boi-pipeline-recommendations-2026-04-29.md
Analysis plan SHA-256: 563445905efa31f05e7f0c3e3d4b91507323282d59a0199db2cd9946e572548e
Root Cause
# In tests/bench/run-local.sh — this check always fires during a BOI worker session:
pgrep -x "claude" && { echo "ERROR: host claude process detected — abort"; exit 1; }
BOI workers are Claude CLI processes. The constraint is correct (prevents contamination) but incompatible with worker-driven execution.
Unblocking Action (highest priority)
Option A — Local run from a clean terminal (no code changes needed):
# Close all Claude sessions first, then:
pgrep -x claude && echo "ABORT: close all Claude sessions first" && exit 1
cd /Users/mrap/mrap-hex/projects/hex-autonomy/boi-experiments/results/2026-04-29
bash run-all-experiments.sh
Option B — GCP run (not blocked by pgrep):
export GCP_PROJECT=scav-hunt-2026
cd /Users/mrap/github.com/mrap/boi/tests/bench
bash run-gcp.sh coldstart # Phase 1 — can run in parallel
bash run-gcp.sh hygiene
bash run-gcp.sh detverify
# Phase 2 after Phase 1: modelassign, cacheopt, condphase
# Phase 3: forkvote, timeout
Recommended Next Experiments (priority order)
| Priority |
Experiment |
Why First |
Blocker |
| 1 |
Exp 7: Deterministic Pre-Verification |
Low cost (~$50), binary question, no model API dependency |
30-min code audit gate (manual) |
| 2 |
Exp 1: Cold-Start Runtime Swap |
Highest expected impact (−80%+ cold start), unblocks Exps 2+3 |
pgrep constraint (clean terminal or GCP) |
| 3 |
Exp 4: Prompt Hygiene Bundle |
Independent of Exp 1, moderate cost (~$25) |
Token calibration (4 runs, clean terminal) |
| 4 |
Exp 5: Conditional Phase Execution |
Independent, moderate impact (≥20% phase reduction) |
pgrep constraint only |
| 5 |
Exp 2: Per-Phase Model Assignment |
High cost impact, but decision depends on Exp 1 runtime |
pgrep constraint + Exp 1 result first |
| 6 |
Exp 3: Prompt Cache Optimization |
−30–80% TTFT expected; requires API dispatch path |
Exp 1 result first |
| 7 |
Exp 8: Adaptive Timeout |
Low priority — 5.2% failure rate barely clears 5% gate |
Historical p95 data per phase type |
| 8 |
Exp 6: Fork-and-Vote |
Highest effort; reserve for when pipeline is stable |
~4 hours Mike annotation time |
Open Questions (what the experiments would have answered)
- Does direct API dispatch (no Claude CLI) reduce per-phase wall time by ≥80% vs. bare-CLI?
- Does routing critic to gemini-flash or Haiku reduce cost ≥50% without quality degradation?
- What is the actual cache hit rate for sequential phases? Does TTFT improve ≥30%?
- Do tool allowlists meaningfully reduce input tokens? (Requires token decomposition calibration first.)
- Which specs trigger conditional phase skipping, and does skipping cause more execute retries?
- What fraction of BOI task failures are deterministic (compile/test) vs. semantic?
Current Deployed State (observational evidence only)
The following were deployed to production before experiments ran — no RCT data:
bare=true for critic phase: observed −96.5% cold-start (5,257ms → 184ms). Confounders possible.
openrouter/gemini-flash for plan-critique + spec-critique: live, no regression reported, no formal cost measurement.
Recommendation: keep the deployed hybrid pending formal RCT results.
Opened by BOI worker S8253 (T0E44). Infrastructure is ready; the only blocker is running from outside a Claude process.
Summary
Eight BOI pipeline experiments were designed and pre-registered for the 2026-04-29 study. Zero experiments ran. The entire experimental run was blocked by a host-isolation constraint built into the bench harness:
run-local.shrequirespgrep -x claude = 0, but BOI workers are Claude processes — so the check fires reliably during every worker session (19+ PIDs observed).Full recommendations paper:
me/decisions/boi-pipeline-recommendations-2026-04-29.mdAnalysis plan SHA-256:
563445905efa31f05e7f0c3e3d4b91507323282d59a0199db2cd9946e572548eRoot Cause
BOI workers are Claude CLI processes. The constraint is correct (prevents contamination) but incompatible with worker-driven execution.
Unblocking Action (highest priority)
Option A — Local run from a clean terminal (no code changes needed):
Option B — GCP run (not blocked by pgrep):
Recommended Next Experiments (priority order)
Open Questions (what the experiments would have answered)
Current Deployed State (observational evidence only)
The following were deployed to production before experiments ran — no RCT data:
bare=truefor critic phase: observed −96.5% cold-start (5,257ms → 184ms). Confounders possible.openrouter/gemini-flashfor plan-critique + spec-critique: live, no regression reported, no formal cost measurement.Recommendation: keep the deployed hybrid pending formal RCT results.
Opened by BOI worker S8253 (T0E44). Infrastructure is ready; the only blocker is running from outside a Claude process.