fix(claude): #333 — post-result hang root-cause fix (rc18 follow-up to #549)#553
Merged
Merged
Conversation
…2 stall semantics + Tier 3 limbo telemetry) + Task 4a state machine Root cause (confirmed via rc17 instrumentation + channelo session 8876c902 reproduction 2026-05-17, 26.6 min limbo): when Claude Code v2.1.143 closes stdout while keeping the subprocess alive, the _post_result_idle_watchdog exited early via task_exited reason=reader_done, bypassing the 600 s post-result-idle countdown. Stall-detector suppression cascades (post_result + active children from MCP heartbeats) hid the limbo from auto-cancel indefinitely. Tier 1 (claude.py) — _post_result_subcountdown: - When reader_done fires while proc.returncode is None, enter a stdout-closed subcountdown instead of returning. Poll for natural exit, defer if pending control_request / ask_question, SIGTERM the process group after timeout_s, 5 s grace, SIGKILL if still alive. - New task_exited reasons: reader_done_but_alive_timeout, subprocess_exited_during_subcountdown. - New info logs: claude.post_result_idle.reader_done_but_alive, subcountdown_deferred, sigterm_after_timeout, sigkill_after_grace. Tier 2 (runner_bridge.py) — defense-in-depth: - _POST_RESULT_LIMBO_THRESHOLD_S = 660 s (post-result idle timeout + grace). - When post-result idle age exceeds the limbo threshold AND no other expected-wait flag (ScheduleWakeup / Monitor / bash) is set, stop suppressing auto-cancel — the watchdog missed an edge case. - One-shot warning: progress_edits.post_result_limbo_detected. Tier 3 (claude.py) — runner.limbo_detected warning: - Fired 30 s into the subcountdown when subprocess is still alive and no pending state holds the session open. - Picked up automatically by untether-issue-watcher → auto:error-report GitHub issues for future regressions. Task 4a (runner.py + claude.py) — subprocess lifecycle state machine: - JsonlStreamState.lifecycle_state + lifecycle_state_entered_at. - JsonlSubprocessRunner._transition_lifecycle() helper emits ``subprocess.state.<name>`` info logs at every transition. - States emitted by the watchdog: reader_eof, subcountdown, limbo, sigterm_sent, sigkill_sent, exited. Other transitions (streaming, idle_post_result, tool_active) deferred to a future patch. Tests (7 new): - test_claude_runner.py: - test_333_reader_done_but_alive_triggers_subcountdown - test_333_subprocess_exits_during_subcountdown - test_333_subcountdown_defers_on_pending_request - test_333_lifecycle_state_transitions_logged - test_exec_bridge.py: - test_333_post_result_limbo_lets_auto_cancel_fire - test_333_post_result_below_limbo_threshold_still_suppresses - test_333_post_result_with_pending_wakeup_keeps_suppression Full suite: 2678 passed, 2 skipped (no regressions from rc17 baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
8 tasks
In fresh CI containers ``time.monotonic()`` returns small values (~50s), but the test's fake clock starts at 1000.0. Computing the ScheduleWakeup deadline from real monotonic time made it look already-expired against the fake clock — so _has_pending_wakeup returned None in CI, _real_pending went False, and auto-cancel fired (test asserted not fired). Express the deadline in the fake clock's frame (1010 + 60 = 1070) so the comparison is consistent regardless of host monotonic baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR #549 (rc17) added entry/exit/tick instrumentation to
_post_result_idle_watchdog. That instrumentation caught the #333 post-result hang in the wild on channelo session8876c902(2026-05-17): 26.6 min limbo between passes 5 and 6 of a/monitorwatch.Root cause: when Claude Code v2.1.143 closes stdout but keeps the subprocess alive,
_post_result_idle_watchdogexited early viatask_exited reason=reader_done, bypassing the 600 s countdown. Stall-detector suppression cascades (post_result + MCP-heartbeat-driven children-active) hid the limbo from auto-cancel indefinitely. Only/cancelbroke it.What changed
Tier 1 —
_post_result_subcountdown(claude.py): whenreader_donefires whileproc.returncode is None, enter a stdout-closed subcountdown. Poll for natural exit, defer if pending control_request / ask_question, SIGTERM the process group aftertimeout_s, 5 s grace, SIGKILL if still alive. Newtask_exitedreasons:reader_done_but_alive_timeout,subprocess_exited_during_subcountdown.Tier 2 — stall-detector semantics (runner_bridge.py):
_POST_RESULT_LIMBO_THRESHOLD_S = 660.0. When post-result idle age > threshold AND no other expected-wait flag is set, stop suppressing auto-cancel. Defense-in-depth for the case where Tier 1 missed.Tier 3 —
runner.limbo_detectedwarning (claude.py): fired 30 s into the subcountdown when subprocess is still alive and no pending state holds the session open. Auto-filed asauto:error-reportGitHub issues byuntether-issue-watcher.Task 4a — subprocess state machine (runner.py + claude.py):
JsonlStreamState.lifecycle_state+_transition_lifecycle()helper. States emitted from the watchdog:reader_eof,subcountdown,limbo,sigterm_sent,sigkill_sent,exited. Permanent canary for future hang-class issues.Files
src/untether/runner.py— JsonlStreamState fields +_transition_lifecyclehelpersrc/untether/runners/claude.py— Tier 1 subcountdown, Tier 3 limbo warning, state machine integration,signalimportsrc/untether/runner_bridge.py— Tier 2 limbo threshold +_post_result_idle_age_secondshelpertests/test_claude_runner.py— 4 new tests for Tier 1 + Task 4atests/test_exec_bridge.py— 3 new tests for Tier 2Test plan
uv run pytest— 2678 passed, 2 skipped (was 2671 + 2 ondev; +7 new tests)uv run pytest tests/test_claude_runner.py tests/test_exec_bridge.py tests/test_claude_control.py -x -qcleanuv run ruff format --check src/ tests/cleanuv run ruff check src/clean@untether_dev_bot: trigger a long MCP-heavy session, observesubprocess.state.*transitions injournalctl --user -u untether-dev; deliberate stdout-EOF-while-alive reproduction is the deterministic gate in unit tests. (Pending rc18 build.)Closes #333.
🤖 Generated with Claude Code