Skip to content

no experiment cleared the bar; here's what to try next (2026-04-29 study) #16

@mrap

Description

@mrap

Summary

Eight BOI pipeline experiments were designed and pre-registered for the 2026-04-29 study. Zero experiments ran. The entire experimental run was blocked by a host-isolation constraint built into the bench harness: run-local.sh requires pgrep -x claude = 0, but BOI workers are Claude processes — so the check fires reliably during every worker session (19+ PIDs observed).

Full recommendations paper: me/decisions/boi-pipeline-recommendations-2026-04-29.md
Analysis plan SHA-256: 563445905efa31f05e7f0c3e3d4b91507323282d59a0199db2cd9946e572548e


Root Cause

# In tests/bench/run-local.sh — this check always fires during a BOI worker session:
pgrep -x "claude" && { echo "ERROR: host claude process detected — abort"; exit 1; }

BOI workers are Claude CLI processes. The constraint is correct (prevents contamination) but incompatible with worker-driven execution.


Unblocking Action (highest priority)

Option A — Local run from a clean terminal (no code changes needed):

# Close all Claude sessions first, then:
pgrep -x claude && echo "ABORT: close all Claude sessions first" && exit 1
cd /Users/mrap/mrap-hex/projects/hex-autonomy/boi-experiments/results/2026-04-29
bash run-all-experiments.sh

Option B — GCP run (not blocked by pgrep):

export GCP_PROJECT=scav-hunt-2026
cd /Users/mrap/github.com/mrap/boi/tests/bench
bash run-gcp.sh coldstart   # Phase 1 — can run in parallel
bash run-gcp.sh hygiene
bash run-gcp.sh detverify
# Phase 2 after Phase 1: modelassign, cacheopt, condphase
# Phase 3: forkvote, timeout

Recommended Next Experiments (priority order)

Priority Experiment Why First Blocker
1 Exp 7: Deterministic Pre-Verification Low cost (~$50), binary question, no model API dependency 30-min code audit gate (manual)
2 Exp 1: Cold-Start Runtime Swap Highest expected impact (−80%+ cold start), unblocks Exps 2+3 pgrep constraint (clean terminal or GCP)
3 Exp 4: Prompt Hygiene Bundle Independent of Exp 1, moderate cost (~$25) Token calibration (4 runs, clean terminal)
4 Exp 5: Conditional Phase Execution Independent, moderate impact (≥20% phase reduction) pgrep constraint only
5 Exp 2: Per-Phase Model Assignment High cost impact, but decision depends on Exp 1 runtime pgrep constraint + Exp 1 result first
6 Exp 3: Prompt Cache Optimization −30–80% TTFT expected; requires API dispatch path Exp 1 result first
7 Exp 8: Adaptive Timeout Low priority — 5.2% failure rate barely clears 5% gate Historical p95 data per phase type
8 Exp 6: Fork-and-Vote Highest effort; reserve for when pipeline is stable ~4 hours Mike annotation time

Open Questions (what the experiments would have answered)

  1. Does direct API dispatch (no Claude CLI) reduce per-phase wall time by ≥80% vs. bare-CLI?
  2. Does routing critic to gemini-flash or Haiku reduce cost ≥50% without quality degradation?
  3. What is the actual cache hit rate for sequential phases? Does TTFT improve ≥30%?
  4. Do tool allowlists meaningfully reduce input tokens? (Requires token decomposition calibration first.)
  5. Which specs trigger conditional phase skipping, and does skipping cause more execute retries?
  6. What fraction of BOI task failures are deterministic (compile/test) vs. semantic?

Current Deployed State (observational evidence only)

The following were deployed to production before experiments ran — no RCT data:

  • bare=true for critic phase: observed −96.5% cold-start (5,257ms → 184ms). Confounders possible.
  • openrouter/gemini-flash for plan-critique + spec-critique: live, no regression reported, no formal cost measurement.

Recommendation: keep the deployed hybrid pending formal RCT results.


Opened by BOI worker S8253 (T0E44). Infrastructure is ready; the only blocker is running from outside a Claude process.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions