fix(hetzner): retry Claude native installer with backoff + clear error#123
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit af41081. Configure here.
| `This is a transient Cloudflare 403. It usually clears within a few minutes.\n\n` + | ||
| `What to do: wait a moment, then re-run \`agentbox prepare --provider hetzner --force\`.\n` + | ||
| `Full trace: /var/log/agentbox/install.log inside any box made from the resulting snapshot.`, | ||
| ); |
There was a problem hiding this comment.
Exit 71 log path wrong
Low Severity
The new exit 71 error tells users the full install trace lives at /var/log/agentbox/install.log on a box from the resulting snapshot. A failed prepare never creates a snapshot and the temp VPS is deleted in the failure cleanup, so that path is not reachable.
Reviewed by Cursor Bugbot for commit af41081. Configure here.
…r on failure The Claude native installer (`curl https://claude.ai/install.sh | bash`) can get an intermittent HTTP 403 from Cloudflare on cloud-datacenter egress IPs (Hetzner among them) under load. The bare `curl | bash` masked it (pipeline exit = bash's 0), so a 403 silently baked a snapshot with no `claude` — boxes from it had no agent, so the in-box tmux session died instantly and `attach` crash-looped on "no server running on /tmp/tmux-1000/default". Replace the masked install with a `retry_backoff` helper: retry the native installer 3x with 60s then 240s backoff (~5 min), keep `set -o pipefail`, and fold `command -v claude` into the retried command so a "succeeded but absent" result also retries. If all attempts fail, abort the bake (exit 71) rather than ship a claude-less snapshot. Applied to hetzner / vercel / e2b bake scripts and the docker Dockerfile (sh/dash plain-loop variant). No npm fallback: `npm install -g @anthropic-ai/claude-code` lacks native-only features and lands at /usr/bin/claude, mismatching the host-seeded installMethod=native and tripping Claude Code's "missing or broken" doctor warning. `prepareHetzner` now special-cases exit 71 with an actionable message (the generic one showed empty stderr because the install runs `bash -x ... 2>&1 | tee`, merging stderr into stdout). Known gap (deferred): the 403 can outlast the ~5-min retry window, so prepare can still fail and need a manual re-run. The validated reliable fix (host-proxies the native binary download, places it at ~/.local/bin/claude) is documented in docs/hertzner_backlog.md as a follow-up. Claude-Session: https://claude.ai/code/session_019m5WHxP4vmsoXaHUhQdY9e
af41081 to
b38c5f9
Compare


Problem
The rebuilt Hetzner base snapshot shipped with no
claude, so boxes from it had no agent — the in-box tmux session died instantly andattachcrash-looped onno server running on /tmp/tmux-1000/default.Root cause: the bake's
curl -fsSL https://claude.ai/install.sh | bash -s stablegot an intermittent HTTP 403 from Cloudflare (which throttles cloud-datacenter egress IPs, Hetzner among them, under load). The barecurl | bashmasked it —curl -fexits non-zero on the 403, but the pipeline's status isbash's0, soset -enever fired and the step "succeeded" while baking a claude-less snapshot.Fix
retry_backoffbash helper retries the native installer 3× with 60s then 240s backoff (~5 min), keepsset -o pipefail, and foldscommand -v claudeinto the retried command so a "succeeded but absent" result also retries. If all attempts fail, the bake aborts (exit 71) rather than ship a broken snapshot. Applied to hetzner / vercel / e2b bake scripts + the dockerDockerfile.box(sh/dash plain-loop variant).npm install -g @anthropic-ai/claude-codelacks native-only features and lands at/usr/bin/claude, mismatching the host-seededinstallMethod=nativeand tripping Claude Code's "missing or broken" doctor warning.prepareHetznernow special-casesexit 71with an actionable error (the generic one showed empty stderr because the install runsbash -x ... 2>&1 | tee, merging stderr into stdout).Known gap (deferred, documented)
The 403 can outlast the ~5-min retry window, so
preparecan still fail and need a manual re-run — this PR does not fully fix that case. The validated reliable fix (host-proxies the native binary download —claude installhas no offline mode, but the downloaded binary placed directly at~/.local/bin/clauderuns offline) is written up indocs/hertzner_backlog.mdas a follow-up.Verification
retry_backoffunit-tested (sleep stubbed) in bash + dash: succeeds attempt 1, recovers attempt 3 with 60s/240s waits, fails after 3.bash -nclean on all bake scripts; hetzner/vercel/e2b package tests pass (148 total).exit 71message bundled into the CLI dist.Action required after merge
Rebake:
agentbox prepare --provider hetzner --force(the fix only applies to new bakes).https://claude.ai/code/session_019m5WHxP4vmsoXaHUhQdY9e
Note
Low Risk
Changes are confined to image/prepare install scripts and Hetzner prepare error messaging; no runtime auth or data-path changes. Residual risk is prepare still failing if CDN blocks persist beyond the retry window.
Overview
Fixes claude-less base snapshots when Cloudflare intermittently 403s the native installer from datacenter egress (e.g. Hetzner). A bare
curl | bashcould report success whileclaudenever landed, breaking in-box tmux/attach.Bake scripts (
install-box.sh,provision.sh,build-template.sh,Dockerfile.box) now use aretry_backoffhelper (3 tries, 60s / 240s waits),pipefail, and acommand -v claudecheck; all failures abort withexit 71instead of shipping a broken image. No npm fallback — native~/.local/bin/claudestays aligned with hostinstallMethod=native.prepareHetznermapsexit 71to an actionable “transient Cloudflare 403, re-runprepare --force” message (generic errors showed empty stderr because install output is tee’d to stdout).docs/hertzner_backlog.mdrecords the issue as resolved and notes the deferred host-proxied binary download if 403 outlasts ~5 minutes. Rebake required after merge for Hetzner bases.Reviewed by Cursor Bugbot for commit af41081. Configure here.