feat: Fly.io remote container dispatch for bench + verify by mrap · Pull Request #18 · mrap/boi

mrap · 2026-04-30T19:08:57Z

Summary

FlyDispatcher (src/remote/fly.rs): create/poll/delete Fly.io Machines via REST API with cost guard
boi bench --remote=fly: dispatch bench runs to Fly.io instead of local Docker; parallel via --concurrency
boi run-spec: new subcommand — container entrypoint that decodes BOI_SPEC_B64, runs the spec, emits JSON summary
Containerized verify (src/worker.rs): remote=fly path routes verify phase to Fly.io machine
Image pipeline: scripts/fly-push.sh, updated Dockerfile + entrypoint.sh, fly.toml
Docs: docs/fly-io-setup.md, docs/remote-dispatch.md, live smoke test results

Live Smoke Test (TCF21 — 2026-04-30)

BATTERY [remote:fly]: 1 specs × 1 pipelines × 1 runs = 1 total runs
  [fly] dispatching [smoke] simple.yaml run 1...
  [fly] done: machine=3287054ec3d548 duration=11.2s cost=$0.0000

Bench Results
  METRIC        smoke
  Avg completion  11s
  Completion rate 100%

machine_id=3287054ec3d548, region=iad, cost=$0.0000 (11.2s × $0.0000026/s).
Full diagnostics: docs/diagnostics/2026-04-30-fly-io-live-verified.md

Background

S7E70 surveyed 26 options and recommended Fly.io Machines (scored 4.45/5 vs Depot.dev 4.04,
Hetzner+Nomad 3.86, Buildkite). Key factors: $14–23/month at 900 runs/month, 1–3s warm starts,
per-second billing, standard OCI images (zero lock-in), official Tailscale support.

Test plan

cargo build passes
boi bench --help shows --remote flag
boi run-spec --help shows subcommand
scripts/fly-push.sh is executable
Live re-run: boi bench --remote=fly --spec tests/bench_specs/simple.yaml --runs 1

🤖 Generated with Claude Code

…ied errors, validation, live OpenRouter Implements axes 1, 2, 7, 8, 9 from the BOI provider architecture design doc: - Provider trait: name, capabilities, validate_config, invoke, cost_estimate, actual_cost - ProviderRegistry: built-ins (claude, openrouter, deterministic) registered at startup - ProviderError: unified enum replacing bespoke per-provider error handling - Validation lifecycle: registration-time, TOML-load-time (loud startup warnings), pre-invocation - runner.rs refactored: registry lookup replaces if/else chain + BOI_FORCE_CLAUDE removed - Unified telemetry: boi.phase.invoked / boi.phase.completed / boi.provider.error - CodexProvider added as third impl (proves extensibility without touching runner.rs) - Live verification: OpenRouter confirmed firing via daemon log + cost signature - docs/providers.md: full Provider trait contract + how-to guide Hard outcome: the OpenRouter-runtime-drop bug is now impossible. The daemon either honors `runtime = "openrouter"` or fails loudly at startup with an actionable message pointing to the missing env var. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- FlyDispatcher (src/remote/fly.rs): create/poll/delete Fly.io Machines via REST API - Cost guard: refuse dispatch when estimated cost > max_cost_usd (default $10) - boi bench --remote=fly: dispatch bench runs to Fly.io instead of local Docker - boi run-spec subcommand: container entrypoint — decodes BOI_SPEC_B64, dispatches, emits JSON - base64 crate replaces hand-rolled encoder (Cargo.toml) - worker.rs: containerized verify support via remote=fly path - tests/bench/Dockerfile + entrypoint.sh: updated for Fly.io machine execution - scripts/fly-push.sh: build + tag + push helper - fly.toml: app config for boi-workers - docs/fly-io-setup.md: account, token, app setup guide - docs/remote-dispatch.md: architecture, cost model, local vs fly guidance - docs/diagnostics/2026-04-30-fly-io-live-verified.md: live smoke test results Live smoke test: machine=3287054ec3d548, 11.2s, cost=$0.0000, 100% completion rate. Ref: S7E70 recommendation (Fly.io scored 4.45/5 in 26-option decision matrix). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…registration Add two experiment guards to prevent zero-signal spec completions: 1. Artifact-gated completion (Gate 2 in completion handler): - Parse key_artifacts[] from spec YAML (path + validate command) - For discover/generate specs: validate all artifacts after tasks complete - All artifacts valid → COMPLETED; any failure → INCONCLUSIVE - Diagnosis written to DB error field via update_spec_with_error() 2. Pre-registration validation (Gate 1 in dispatcher): - Reject discover/generate specs missing hypothesis, success_criteria, key_artifacts - Optional preconditions[] run as t-0 checks; failure → INCONCLUSIVE - execute/challenge mode: experiment fields optional (no regression) New terminal state: INCONCLUSIVE — tasks ran but spec produced no declared answer. Tests: 19 new integration tests in tests/test_experiment_guards.rs (all passing). Resolves: S1511 boi-experiment-validation-guards

mrap and others added 3 commits April 30, 2026 01:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Fly.io remote container dispatch for bench + verify#18

feat: Fly.io remote container dispatch for bench + verify#18
mrap wants to merge 3 commits into
mainfrom
boi/S0723-fly-io-remote-dispatch

mrap commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrap commented Apr 30, 2026

Summary

Live Smoke Test (TCF21 — 2026-04-30)

Background

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant