Skip to content

feat: Fly.io remote container dispatch for bench + verify#18

Open
mrap wants to merge 3 commits into
mainfrom
boi/S0723-fly-io-remote-dispatch
Open

feat: Fly.io remote container dispatch for bench + verify#18
mrap wants to merge 3 commits into
mainfrom
boi/S0723-fly-io-remote-dispatch

Conversation

@mrap
Copy link
Copy Markdown
Owner

@mrap mrap commented Apr 30, 2026

Summary

  • FlyDispatcher (src/remote/fly.rs): create/poll/delete Fly.io Machines via REST API with cost guard
  • boi bench --remote=fly: dispatch bench runs to Fly.io instead of local Docker; parallel via --concurrency
  • boi run-spec: new subcommand — container entrypoint that decodes BOI_SPEC_B64, runs the spec, emits JSON summary
  • Containerized verify (src/worker.rs): remote=fly path routes verify phase to Fly.io machine
  • Image pipeline: scripts/fly-push.sh, updated Dockerfile + entrypoint.sh, fly.toml
  • Docs: docs/fly-io-setup.md, docs/remote-dispatch.md, live smoke test results

Live Smoke Test (TCF21 — 2026-04-30)

BATTERY [remote:fly]: 1 specs × 1 pipelines × 1 runs = 1 total runs
  [fly] dispatching [smoke] simple.yaml run 1...
  [fly] done: machine=3287054ec3d548 duration=11.2s cost=$0.0000

Bench Results
  METRIC        smoke
  Avg completion  11s
  Completion rate 100%

machine_id=3287054ec3d548, region=iad, cost=$0.0000 (11.2s × $0.0000026/s).
Full diagnostics: docs/diagnostics/2026-04-30-fly-io-live-verified.md

Background

S7E70 surveyed 26 options and recommended Fly.io Machines (scored 4.45/5 vs Depot.dev 4.04,
Hetzner+Nomad 3.86, Buildkite). Key factors: $14–23/month at 900 runs/month, 1–3s warm starts,
per-second billing, standard OCI images (zero lock-in), official Tailscale support.

Test plan

  • cargo build passes
  • boi bench --help shows --remote flag
  • boi run-spec --help shows subcommand
  • scripts/fly-push.sh is executable
  • Live re-run: boi bench --remote=fly --spec tests/bench_specs/simple.yaml --runs 1

🤖 Generated with Claude Code

mrap and others added 3 commits April 30, 2026 01:45
…ied errors, validation, live OpenRouter

Implements axes 1, 2, 7, 8, 9 from the BOI provider architecture design doc:

- Provider trait: name, capabilities, validate_config, invoke, cost_estimate, actual_cost
- ProviderRegistry: built-ins (claude, openrouter, deterministic) registered at startup
- ProviderError: unified enum replacing bespoke per-provider error handling
- Validation lifecycle: registration-time, TOML-load-time (loud startup warnings), pre-invocation
- runner.rs refactored: registry lookup replaces if/else chain + BOI_FORCE_CLAUDE removed
- Unified telemetry: boi.phase.invoked / boi.phase.completed / boi.provider.error
- CodexProvider added as third impl (proves extensibility without touching runner.rs)
- Live verification: OpenRouter confirmed firing via daemon log + cost signature
- docs/providers.md: full Provider trait contract + how-to guide

Hard outcome: the OpenRouter-runtime-drop bug is now impossible. The daemon
either honors `runtime = "openrouter"` or fails loudly at startup with an
actionable message pointing to the missing env var.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- FlyDispatcher (src/remote/fly.rs): create/poll/delete Fly.io Machines via REST API
- Cost guard: refuse dispatch when estimated cost > max_cost_usd (default $10)
- boi bench --remote=fly: dispatch bench runs to Fly.io instead of local Docker
- boi run-spec subcommand: container entrypoint — decodes BOI_SPEC_B64, dispatches, emits JSON
- base64 crate replaces hand-rolled encoder (Cargo.toml)
- worker.rs: containerized verify support via remote=fly path
- tests/bench/Dockerfile + entrypoint.sh: updated for Fly.io machine execution
- scripts/fly-push.sh: build + tag + push helper
- fly.toml: app config for boi-workers
- docs/fly-io-setup.md: account, token, app setup guide
- docs/remote-dispatch.md: architecture, cost model, local vs fly guidance
- docs/diagnostics/2026-04-30-fly-io-live-verified.md: live smoke test results

Live smoke test: machine=3287054ec3d548, 11.2s, cost=$0.0000, 100% completion rate.
Ref: S7E70 recommendation (Fly.io scored 4.45/5 in 26-option decision matrix).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…registration

Add two experiment guards to prevent zero-signal spec completions:

1. Artifact-gated completion (Gate 2 in completion handler):
   - Parse key_artifacts[] from spec YAML (path + validate command)
   - For discover/generate specs: validate all artifacts after tasks complete
   - All artifacts valid → COMPLETED; any failure → INCONCLUSIVE
   - Diagnosis written to DB error field via update_spec_with_error()

2. Pre-registration validation (Gate 1 in dispatcher):
   - Reject discover/generate specs missing hypothesis, success_criteria, key_artifacts
   - Optional preconditions[] run as t-0 checks; failure → INCONCLUSIVE
   - execute/challenge mode: experiment fields optional (no regression)

New terminal state: INCONCLUSIVE — tasks ran but spec produced no declared answer.
Tests: 19 new integration tests in tests/test_experiment_guards.rs (all passing).
Resolves: S1511 boi-experiment-validation-guards
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant