feat: Fly.io remote container dispatch for bench + verify#18
Open
mrap wants to merge 3 commits into
Open
Conversation
…ied errors, validation, live OpenRouter Implements axes 1, 2, 7, 8, 9 from the BOI provider architecture design doc: - Provider trait: name, capabilities, validate_config, invoke, cost_estimate, actual_cost - ProviderRegistry: built-ins (claude, openrouter, deterministic) registered at startup - ProviderError: unified enum replacing bespoke per-provider error handling - Validation lifecycle: registration-time, TOML-load-time (loud startup warnings), pre-invocation - runner.rs refactored: registry lookup replaces if/else chain + BOI_FORCE_CLAUDE removed - Unified telemetry: boi.phase.invoked / boi.phase.completed / boi.provider.error - CodexProvider added as third impl (proves extensibility without touching runner.rs) - Live verification: OpenRouter confirmed firing via daemon log + cost signature - docs/providers.md: full Provider trait contract + how-to guide Hard outcome: the OpenRouter-runtime-drop bug is now impossible. The daemon either honors `runtime = "openrouter"` or fails loudly at startup with an actionable message pointing to the missing env var. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- FlyDispatcher (src/remote/fly.rs): create/poll/delete Fly.io Machines via REST API - Cost guard: refuse dispatch when estimated cost > max_cost_usd (default $10) - boi bench --remote=fly: dispatch bench runs to Fly.io instead of local Docker - boi run-spec subcommand: container entrypoint — decodes BOI_SPEC_B64, dispatches, emits JSON - base64 crate replaces hand-rolled encoder (Cargo.toml) - worker.rs: containerized verify support via remote=fly path - tests/bench/Dockerfile + entrypoint.sh: updated for Fly.io machine execution - scripts/fly-push.sh: build + tag + push helper - fly.toml: app config for boi-workers - docs/fly-io-setup.md: account, token, app setup guide - docs/remote-dispatch.md: architecture, cost model, local vs fly guidance - docs/diagnostics/2026-04-30-fly-io-live-verified.md: live smoke test results Live smoke test: machine=3287054ec3d548, 11.2s, cost=$0.0000, 100% completion rate. Ref: S7E70 recommendation (Fly.io scored 4.45/5 in 26-option decision matrix). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…registration Add two experiment guards to prevent zero-signal spec completions: 1. Artifact-gated completion (Gate 2 in completion handler): - Parse key_artifacts[] from spec YAML (path + validate command) - For discover/generate specs: validate all artifacts after tasks complete - All artifacts valid → COMPLETED; any failure → INCONCLUSIVE - Diagnosis written to DB error field via update_spec_with_error() 2. Pre-registration validation (Gate 1 in dispatcher): - Reject discover/generate specs missing hypothesis, success_criteria, key_artifacts - Optional preconditions[] run as t-0 checks; failure → INCONCLUSIVE - execute/challenge mode: experiment fields optional (no regression) New terminal state: INCONCLUSIVE — tasks ran but spec produced no declared answer. Tests: 19 new integration tests in tests/test_experiment_guards.rs (all passing). Resolves: S1511 boi-experiment-validation-guards
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
src/remote/fly.rs): create/poll/delete Fly.io Machines via REST API with cost guardboi bench --remote=fly: dispatch bench runs to Fly.io instead of local Docker; parallel via--concurrencyboi run-spec: new subcommand — container entrypoint that decodesBOI_SPEC_B64, runs the spec, emits JSON summarysrc/worker.rs):remote=flypath routes verify phase to Fly.io machinescripts/fly-push.sh, updatedDockerfile+entrypoint.sh,fly.tomldocs/fly-io-setup.md,docs/remote-dispatch.md, live smoke test resultsLive Smoke Test (TCF21 — 2026-04-30)
machine_id=
3287054ec3d548, region=iad, cost=$0.0000 (11.2s × $0.0000026/s).Full diagnostics:
docs/diagnostics/2026-04-30-fly-io-live-verified.mdBackground
S7E70 surveyed 26 options and recommended Fly.io Machines (scored 4.45/5 vs Depot.dev 4.04,
Hetzner+Nomad 3.86, Buildkite). Key factors: $14–23/month at 900 runs/month, 1–3s warm starts,
per-second billing, standard OCI images (zero lock-in), official Tailscale support.
Test plan
cargo buildpassesboi bench --helpshows--remoteflagboi run-spec --helpshows subcommandscripts/fly-push.shis executableboi bench --remote=fly --spec tests/bench_specs/simple.yaml --runs 1🤖 Generated with Claude Code