Skip to content

fix(hetzner): retry Claude native installer with backoff + clear error#123

Merged
madarco merged 1 commit into
nightlyfrom
fix/hetzner-claude-install-retry
Jun 27, 2026
Merged

fix(hetzner): retry Claude native installer with backoff + clear error#123
madarco merged 1 commit into
nightlyfrom
fix/hetzner-claude-install-retry

Conversation

@madarco

@madarco madarco commented Jun 27, 2026

Copy link
Copy Markdown
Owner

Problem

The rebuilt Hetzner base snapshot shipped with no claude, so boxes from it had no agent — the in-box tmux session died instantly and attach crash-looped on no server running on /tmp/tmux-1000/default.

Root cause: the bake's curl -fsSL https://claude.ai/install.sh | bash -s stable got an intermittent HTTP 403 from Cloudflare (which throttles cloud-datacenter egress IPs, Hetzner among them, under load). The bare curl | bash masked it — curl -f exits non-zero on the 403, but the pipeline's status is bash's 0, so set -e never fired and the step "succeeded" while baking a claude-less snapshot.

Fix

  • A retry_backoff bash helper retries the native installer 3× with 60s then 240s backoff (~5 min), keeps set -o pipefail, and folds command -v claude into the retried command so a "succeeded but absent" result also retries. If all attempts fail, the bake aborts (exit 71) rather than ship a broken snapshot. Applied to hetzner / vercel / e2b bake scripts + the docker Dockerfile.box (sh/dash plain-loop variant).
  • No npm fallback (an earlier attempt used one): npm install -g @anthropic-ai/claude-code lacks native-only features and lands at /usr/bin/claude, mismatching the host-seeded installMethod=native and tripping Claude Code's "missing or broken" doctor warning.
  • prepareHetzner now special-cases exit 71 with an actionable error (the generic one showed empty stderr because the install runs bash -x ... 2>&1 | tee, merging stderr into stdout).

Known gap (deferred, documented)

The 403 can outlast the ~5-min retry window, so prepare can still fail and need a manual re-run — this PR does not fully fix that case. The validated reliable fix (host-proxies the native binary download — claude install has no offline mode, but the downloaded binary placed directly at ~/.local/bin/claude runs offline) is written up in docs/hertzner_backlog.md as a follow-up.

Verification

  • retry_backoff unit-tested (sleep stubbed) in bash + dash: succeeds attempt 1, recovers attempt 3 with 60s/240s waits, fails after 3.
  • bash -n clean on all bake scripts; hetzner/vercel/e2b package tests pass (148 total).
  • New exit 71 message bundled into the CLI dist.

Action required after merge

Rebake: agentbox prepare --provider hetzner --force (the fix only applies to new bakes).

https://claude.ai/code/session_019m5WHxP4vmsoXaHUhQdY9e


Note

Low Risk
Changes are confined to image/prepare install scripts and Hetzner prepare error messaging; no runtime auth or data-path changes. Residual risk is prepare still failing if CDN blocks persist beyond the retry window.

Overview
Fixes claude-less base snapshots when Cloudflare intermittently 403s the native installer from datacenter egress (e.g. Hetzner). A bare curl | bash could report success while claude never landed, breaking in-box tmux/attach.

Bake scripts (install-box.sh, provision.sh, build-template.sh, Dockerfile.box) now use a retry_backoff helper (3 tries, 60s / 240s waits), pipefail, and a command -v claude check; all failures abort with exit 71 instead of shipping a broken image. No npm fallback — native ~/.local/bin/claude stays aligned with host installMethod=native.

prepareHetzner maps exit 71 to an actionable “transient Cloudflare 403, re-run prepare --force” message (generic errors showed empty stderr because install output is tee’d to stdout).

docs/hertzner_backlog.md records the issue as resolved and notes the deferred host-proxied binary download if 403 outlasts ~5 minutes. Rebake required after merge for Hetzner bases.

Reviewed by Cursor Bugbot for commit af41081. Configure here.

@vercel

vercel Bot commented Jun 27, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agentbox-web Ready Ready Preview, Comment Jun 27, 2026 12:30pm

Request Review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit af41081. Configure here.

`This is a transient Cloudflare 403. It usually clears within a few minutes.\n\n` +
`What to do: wait a moment, then re-run \`agentbox prepare --provider hetzner --force\`.\n` +
`Full trace: /var/log/agentbox/install.log inside any box made from the resulting snapshot.`,
);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exit 71 log path wrong

Low Severity

The new exit 71 error tells users the full install trace lives at /var/log/agentbox/install.log on a box from the resulting snapshot. A failed prepare never creates a snapshot and the temp VPS is deleted in the failure cleanup, so that path is not reachable.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit af41081. Configure here.

…r on failure

The Claude native installer (`curl https://claude.ai/install.sh | bash`) can
get an intermittent HTTP 403 from Cloudflare on cloud-datacenter egress IPs
(Hetzner among them) under load. The bare `curl | bash` masked it (pipeline
exit = bash's 0), so a 403 silently baked a snapshot with no `claude` — boxes
from it had no agent, so the in-box tmux session died instantly and `attach`
crash-looped on "no server running on /tmp/tmux-1000/default".

Replace the masked install with a `retry_backoff` helper: retry the native
installer 3x with 60s then 240s backoff (~5 min), keep `set -o pipefail`, and
fold `command -v claude` into the retried command so a "succeeded but absent"
result also retries. If all attempts fail, abort the bake (exit 71) rather
than ship a claude-less snapshot. Applied to hetzner / vercel / e2b bake
scripts and the docker Dockerfile (sh/dash plain-loop variant).

No npm fallback: `npm install -g @anthropic-ai/claude-code` lacks native-only
features and lands at /usr/bin/claude, mismatching the host-seeded
installMethod=native and tripping Claude Code's "missing or broken" doctor
warning.

`prepareHetzner` now special-cases exit 71 with an actionable message (the
generic one showed empty stderr because the install runs `bash -x ... 2>&1 |
tee`, merging stderr into stdout).

Known gap (deferred): the 403 can outlast the ~5-min retry window, so prepare
can still fail and need a manual re-run. The validated reliable fix
(host-proxies the native binary download, places it at ~/.local/bin/claude) is
documented in docs/hertzner_backlog.md as a follow-up.

Claude-Session: https://claude.ai/code/session_019m5WHxP4vmsoXaHUhQdY9e
@madarco madarco force-pushed the fix/hetzner-claude-install-retry branch from af41081 to b38c5f9 Compare June 27, 2026 12:29
@madarco madarco merged commit 0991c33 into nightly Jun 27, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant