Skip to content

Confirm OAuth success on the agent before reporting login exit#259

Merged
mathisdittrich merged 1 commit into
mainfrom
fix/claude-login-await-success
May 20, 2026
Merged

Confirm OAuth success on the agent before reporting login exit#259
mathisdittrich merged 1 commit into
mainfrom
fix/claude-login-await-success

Conversation

@mathisdittrich
Copy link
Copy Markdown
Collaborator

@mathisdittrich mathisdittrich commented May 20, 2026

Summary

  • The box-agent now confirms a Claude / Codex login succeeded itself, before telling the cloud the pty exited
  • When the login pty closes, the agent polls claude auth status (resp. codex login status) directly at 0.5s intervals for up to 15s
  • On success it emits claude_authed / codex_authed and skips the *_login_exited event; on timeout it emits *_login_exited so the cloud fails fast
  • Each login attempt now emits exactly one terminal event

The bug

Cloud-driven Claude login occasionally showed the user "Try again" even though the sign-in had actually worked.

Root cause is a race between two components that don't share timing:

  1. Agent auth-poll loop (_auth_poll_loop) runs at 1s cadence for the first 60s of agent uptime, then 15s. Every cloud-driven login happens well past that 60s (box has booted + gone through Telegram setup first), so it's on the 15s cadence. It also only emits claude_authed on a state flip.
  2. Cloud exit watchdog starts a 10s timer on claude_login_exited and calls set_failed("…Try again…") if the login state hasn't flipped to done.

Timeline for a typical box: user pastes code → claude writes ~/.claude.json + exits (~2s) → agent sends claude_login_exited → cloud starts 10s watchdog → agent's next auth-poll tick is up to 15s away → watchdog fires set_failed at T+10s, the real claude_authed lands at T+15s. The user sees "Try again" for a few seconds before the dialog advances.

The fix

Move success-detection to the agent, which sits on the same filesystem claude writes its credentials to — the right place to own the decision. The agent's tight 0.5s post-exit poll confirms the credential write within a second or two of it happening, independent of the lazy 15s general poll. The general poll keeps its 15s cadence; it only exists for logout detection, where latency is harmless.

Pairs with a cloud-side PR that drops the now-unnecessary exit watchdog and treats *_login_exited as a definitive failure signal.

Files

  • agent/box_agent.py — new _claude_login_await_success + _codex_login_await_success; called from the two pty-exit handlers

Test plan

  • ruff check agent/ clean, py_compile clean
  • Staging: complete a Claude OAuth on a box that's been up >60s, confirm the dialog advances with no "Try again" flash
  • Staging: same for Codex device-auth
  • Negative: paste a bad code / cancel mid-flow, confirm claude_login_exited still fires within ~15s and the cloud surfaces the failure

🤖 Generated with Claude Code


Summary by cubic

The agent now verifies Claude/Codex OAuth success locally before reporting a login exit. This removes the race that briefly showed “Try again” after a successful sign-in.

  • Bug Fixes
    • After the login pty closes, poll claude auth status / codex login status every 0.5s for up to 15s.
    • On success, emit claude_authed / codex_authed and skip *_login_exited; on timeout, emit *_login_exited to fail fast.
    • Ensure exactly one terminal event per attempt; keep the general 15s auth poll for logout only.
    • Implemented in agent/box_agent.py via _claude_login_await_success and _codex_login_await_success.

Written for commit 7e5430d. Summary will update on new commits. Review in cubic

The cloud's claude_login_exited / codex_login_exited watchdog was
racing the box-agent's general auth-poll loop. That poll runs at a
15s cadence once the box is past its first minute of uptime — which
is every cloud-driven login, since the box has been alive through
provisioning + Telegram setup already. Worst-case ordering put the
`claude_authed` event ~15s after pty exit, past the cloud's 10s
watchdog, so the user saw a spurious "Try again" on a sign-in that
actually worked.

Fix: make the agent confirm success itself. When the login pty
closes, poll `claude auth status` (resp. `codex login status`)
directly at a 0.5s cadence for up to 15s. On success emit
`claude_authed` / `codex_authed` and skip the exited event; on
timeout emit the exited event so the cloud can fail fast.

The agent sits on the same filesystem claude writes ~/.claude.json
to, so it is the right place to own this decision. Each attempt now
emits exactly one terminal event and the cloud has no race to lose.

The general auth-poll loop keeps its lazy 15s cadence — it only
exists for logout detection, where latency is harmless.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

Re-trigger cubic

@mathisdittrich mathisdittrich merged commit 34b844f into main May 20, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant