Confirm OAuth success on the agent before reporting login exit#259
Merged
Conversation
The cloud's claude_login_exited / codex_login_exited watchdog was racing the box-agent's general auth-poll loop. That poll runs at a 15s cadence once the box is past its first minute of uptime — which is every cloud-driven login, since the box has been alive through provisioning + Telegram setup already. Worst-case ordering put the `claude_authed` event ~15s after pty exit, past the cloud's 10s watchdog, so the user saw a spurious "Try again" on a sign-in that actually worked. Fix: make the agent confirm success itself. When the login pty closes, poll `claude auth status` (resp. `codex login status`) directly at a 0.5s cadence for up to 15s. On success emit `claude_authed` / `codex_authed` and skip the exited event; on timeout emit the exited event so the cloud can fail fast. The agent sits on the same filesystem claude writes ~/.claude.json to, so it is the right place to own this decision. Each attempt now emits exactly one terminal event and the cloud has no race to lose. The general auth-poll loop keeps its lazy 15s cadence — it only exists for logout detection, where latency is harmless. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
claude auth status(resp.codex login status) directly at 0.5s intervals for up to 15sclaude_authed/codex_authedand skips the*_login_exitedevent; on timeout it emits*_login_exitedso the cloud fails fastThe bug
Cloud-driven Claude login occasionally showed the user "Try again" even though the sign-in had actually worked.
Root cause is a race between two components that don't share timing:
_auth_poll_loop) runs at 1s cadence for the first 60s of agent uptime, then 15s. Every cloud-driven login happens well past that 60s (box has booted + gone through Telegram setup first), so it's on the 15s cadence. It also only emitsclaude_authedon a state flip.claude_login_exitedand callsset_failed("…Try again…")if the login state hasn't flipped todone.Timeline for a typical box: user pastes code → claude writes
~/.claude.json+ exits (~2s) → agent sendsclaude_login_exited→ cloud starts 10s watchdog → agent's next auth-poll tick is up to 15s away → watchdog firesset_failedat T+10s, the realclaude_authedlands at T+15s. The user sees "Try again" for a few seconds before the dialog advances.The fix
Move success-detection to the agent, which sits on the same filesystem claude writes its credentials to — the right place to own the decision. The agent's tight 0.5s post-exit poll confirms the credential write within a second or two of it happening, independent of the lazy 15s general poll. The general poll keeps its 15s cadence; it only exists for logout detection, where latency is harmless.
Pairs with a cloud-side PR that drops the now-unnecessary exit watchdog and treats
*_login_exitedas a definitive failure signal.Files
agent/box_agent.py— new_claude_login_await_success+_codex_login_await_success; called from the two pty-exit handlersTest plan
ruff check agent/clean,py_compilecleanclaude_login_exitedstill fires within ~15s and the cloud surfaces the failure🤖 Generated with Claude Code
Summary by cubic
The agent now verifies Claude/Codex OAuth success locally before reporting a login exit. This removes the race that briefly showed “Try again” after a successful sign-in.
claude auth status/codex login statusevery 0.5s for up to 15s.claude_authed/codex_authedand skip*_login_exited; on timeout, emit*_login_exitedto fail fast.agent/box_agent.pyvia_claude_login_await_successand_codex_login_await_success.Written for commit 7e5430d. Summary will update on new commits. Review in cubic