no experiment cleared the bar; here's what to try next (2026-04-29 study)

## Summary

Eight BOI pipeline experiments were designed and pre-registered for the 2026-04-29 study. **Zero experiments ran.** The entire experimental run was blocked by a host-isolation constraint built into the bench harness: `run-local.sh` requires `pgrep -x claude = 0`, but BOI workers are Claude processes — so the check fires reliably during every worker session (19+ PIDs observed).

**Full recommendations paper:** `me/decisions/boi-pipeline-recommendations-2026-04-29.md`
**Analysis plan SHA-256:** `563445905efa31f05e7f0c3e3d4b91507323282d59a0199db2cd9946e572548e`

---

## Root Cause

```bash
# In tests/bench/run-local.sh — this check always fires during a BOI worker session:
pgrep -x "claude" && { echo "ERROR: host claude process detected — abort"; exit 1; }
```

BOI workers are Claude CLI processes. The constraint is correct (prevents contamination) but incompatible with worker-driven execution.

---

## Unblocking Action (highest priority)

**Option A — Local run from a clean terminal (no code changes needed):**
```bash
# Close all Claude sessions first, then:
pgrep -x claude && echo "ABORT: close all Claude sessions first" && exit 1
cd /Users/mrap/mrap-hex/projects/hex-autonomy/boi-experiments/results/2026-04-29
bash run-all-experiments.sh
```

**Option B — GCP run (not blocked by pgrep):**
```bash
export GCP_PROJECT=scav-hunt-2026
cd /Users/mrap/github.com/mrap/boi/tests/bench
bash run-gcp.sh coldstart   # Phase 1 — can run in parallel
bash run-gcp.sh hygiene
bash run-gcp.sh detverify
# Phase 2 after Phase 1: modelassign, cacheopt, condphase
# Phase 3: forkvote, timeout
```

---

## Recommended Next Experiments (priority order)

| Priority | Experiment | Why First | Blocker |
|----------|-----------|----------|---------|
| 1 | **Exp 7: Deterministic Pre-Verification** | Low cost (~$50), binary question, no model API dependency | 30-min code audit gate (manual) |
| 2 | **Exp 1: Cold-Start Runtime Swap** | Highest expected impact (−80%+ cold start), unblocks Exps 2+3 | pgrep constraint (clean terminal or GCP) |
| 3 | **Exp 4: Prompt Hygiene Bundle** | Independent of Exp 1, moderate cost (~$25) | Token calibration (4 runs, clean terminal) |
| 4 | **Exp 5: Conditional Phase Execution** | Independent, moderate impact (≥20% phase reduction) | pgrep constraint only |
| 5 | **Exp 2: Per-Phase Model Assignment** | High cost impact, but decision depends on Exp 1 runtime | pgrep constraint + Exp 1 result first |
| 6 | **Exp 3: Prompt Cache Optimization** | −30–80% TTFT expected; requires API dispatch path | Exp 1 result first |
| 7 | **Exp 8: Adaptive Timeout** | Low priority — 5.2% failure rate barely clears 5% gate | Historical p95 data per phase type |
| 8 | **Exp 6: Fork-and-Vote** | Highest effort; reserve for when pipeline is stable | ~4 hours Mike annotation time |

---

## Open Questions (what the experiments would have answered)

1. Does direct API dispatch (no Claude CLI) reduce per-phase wall time by ≥80% vs. bare-CLI?
2. Does routing critic to gemini-flash or Haiku reduce cost ≥50% without quality degradation?
3. What is the actual cache hit rate for sequential phases? Does TTFT improve ≥30%?
4. Do tool allowlists meaningfully reduce input tokens? (Requires token decomposition calibration first.)
5. Which specs trigger conditional phase skipping, and does skipping cause more execute retries?
6. What fraction of BOI task failures are deterministic (compile/test) vs. semantic?

---

## Current Deployed State (observational evidence only)

The following were deployed to production before experiments ran — no RCT data:

- `bare=true` for critic phase: observed −96.5% cold-start (5,257ms → 184ms). Confounders possible.
- `openrouter/gemini-flash` for plan-critique + spec-critique: live, no regression reported, no formal cost measurement.

Recommendation: keep the deployed hybrid pending formal RCT results.

---

*Opened by BOI worker S8253 (T0E44). Infrastructure is ready; the only blocker is running from outside a Claude process.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

no experiment cleared the bar; here's what to try next (2026-04-29 study) #16

Summary

Root Cause

Unblocking Action (highest priority)

Recommended Next Experiments (priority order)

Open Questions (what the experiments would have answered)

Current Deployed State (observational evidence only)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Priority	Experiment	Why First	Blocker
1	Exp 7: Deterministic Pre-Verification	Low cost (~$50), binary question, no model API dependency	30-min code audit gate (manual)
2	Exp 1: Cold-Start Runtime Swap	Highest expected impact (−80%+ cold start), unblocks Exps 2+3	pgrep constraint (clean terminal or GCP)
3	Exp 4: Prompt Hygiene Bundle	Independent of Exp 1, moderate cost (~$25)	Token calibration (4 runs, clean terminal)
4	Exp 5: Conditional Phase Execution	Independent, moderate impact (≥20% phase reduction)	pgrep constraint only
5	Exp 2: Per-Phase Model Assignment	High cost impact, but decision depends on Exp 1 runtime	pgrep constraint + Exp 1 result first
6	Exp 3: Prompt Cache Optimization	−30–80% TTFT expected; requires API dispatch path	Exp 1 result first
7	Exp 8: Adaptive Timeout	Low priority — 5.2% failure rate barely clears 5% gate	Historical p95 data per phase type
8	Exp 6: Fork-and-Vote	Highest effort; reserve for when pipeline is stable	~4 hours Mike annotation time

no experiment cleared the bar; here's what to try next (2026-04-29 study) #16

Description

Summary

Root Cause

Unblocking Action (highest priority)

Recommended Next Experiments (priority order)

Open Questions (what the experiments would have answered)

Current Deployed State (observational evidence only)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions