Skip to content

fix(claude): #333 — post-result hang root-cause fix (rc18 follow-up to #549)#553

Merged
Nathan Schram (nathanschram) merged 3 commits into
devfrom
fix/333-post-result-hang
May 17, 2026
Merged

fix(claude): #333 — post-result hang root-cause fix (rc18 follow-up to #549)#553
Nathan Schram (nathanschram) merged 3 commits into
devfrom
fix/333-post-result-hang

Conversation

@nathanschram
Copy link
Copy Markdown
Member

Summary

PR #549 (rc17) added entry/exit/tick instrumentation to _post_result_idle_watchdog. That instrumentation caught the #333 post-result hang in the wild on channelo session 8876c902 (2026-05-17): 26.6 min limbo between passes 5 and 6 of a /monitor watch.

Root cause: when Claude Code v2.1.143 closes stdout but keeps the subprocess alive, _post_result_idle_watchdog exited early via task_exited reason=reader_done, bypassing the 600 s countdown. Stall-detector suppression cascades (post_result + MCP-heartbeat-driven children-active) hid the limbo from auto-cancel indefinitely. Only /cancel broke it.

What changed

Tier 1 — _post_result_subcountdown (claude.py): when reader_done fires while proc.returncode is None, enter a stdout-closed subcountdown. Poll for natural exit, defer if pending control_request / ask_question, SIGTERM the process group after timeout_s, 5 s grace, SIGKILL if still alive. New task_exited reasons: reader_done_but_alive_timeout, subprocess_exited_during_subcountdown.

Tier 2 — stall-detector semantics (runner_bridge.py): _POST_RESULT_LIMBO_THRESHOLD_S = 660.0. When post-result idle age > threshold AND no other expected-wait flag is set, stop suppressing auto-cancel. Defense-in-depth for the case where Tier 1 missed.

Tier 3 — runner.limbo_detected warning (claude.py): fired 30 s into the subcountdown when subprocess is still alive and no pending state holds the session open. Auto-filed as auto:error-report GitHub issues by untether-issue-watcher.

Task 4a — subprocess state machine (runner.py + claude.py): JsonlStreamState.lifecycle_state + _transition_lifecycle() helper. States emitted from the watchdog: reader_eof, subcountdown, limbo, sigterm_sent, sigkill_sent, exited. Permanent canary for future hang-class issues.

Files

  • src/untether/runner.py — JsonlStreamState fields + _transition_lifecycle helper
  • src/untether/runners/claude.py — Tier 1 subcountdown, Tier 3 limbo warning, state machine integration, signal import
  • src/untether/runner_bridge.py — Tier 2 limbo threshold + _post_result_idle_age_seconds helper
  • tests/test_claude_runner.py — 4 new tests for Tier 1 + Task 4a
  • tests/test_exec_bridge.py — 3 new tests for Tier 2

Test plan

  • uv run pytest — 2678 passed, 2 skipped (was 2671 + 2 on dev; +7 new tests)
  • uv run pytest tests/test_claude_runner.py tests/test_exec_bridge.py tests/test_claude_control.py -x -q clean
  • uv run ruff format --check src/ tests/ clean
  • uv run ruff check src/ clean
  • Tier 6 integration on @untether_dev_bot: trigger a long MCP-heavy session, observe subprocess.state.* transitions in journalctl --user -u untether-dev; deliberate stdout-EOF-while-alive reproduction is the deterministic gate in unit tests. (Pending rc18 build.)

Closes #333.

🤖 Generated with Claude Code

…2 stall semantics + Tier 3 limbo telemetry) + Task 4a state machine

Root cause (confirmed via rc17 instrumentation + channelo session
8876c902 reproduction 2026-05-17, 26.6 min limbo): when Claude Code
v2.1.143 closes stdout while keeping the subprocess alive, the
_post_result_idle_watchdog exited early via task_exited reason=reader_done,
bypassing the 600 s post-result-idle countdown. Stall-detector suppression
cascades (post_result + active children from MCP heartbeats) hid the
limbo from auto-cancel indefinitely.

Tier 1 (claude.py) — _post_result_subcountdown:
- When reader_done fires while proc.returncode is None, enter a
  stdout-closed subcountdown instead of returning. Poll for natural
  exit, defer if pending control_request / ask_question, SIGTERM the
  process group after timeout_s, 5 s grace, SIGKILL if still alive.
- New task_exited reasons: reader_done_but_alive_timeout,
  subprocess_exited_during_subcountdown.
- New info logs: claude.post_result_idle.reader_done_but_alive,
  subcountdown_deferred, sigterm_after_timeout, sigkill_after_grace.

Tier 2 (runner_bridge.py) — defense-in-depth:
- _POST_RESULT_LIMBO_THRESHOLD_S = 660 s (post-result idle timeout + grace).
- When post-result idle age exceeds the limbo threshold AND no other
  expected-wait flag (ScheduleWakeup / Monitor / bash) is set, stop
  suppressing auto-cancel — the watchdog missed an edge case.
- One-shot warning: progress_edits.post_result_limbo_detected.

Tier 3 (claude.py) — runner.limbo_detected warning:
- Fired 30 s into the subcountdown when subprocess is still alive and
  no pending state holds the session open.
- Picked up automatically by untether-issue-watcher → auto:error-report
  GitHub issues for future regressions.

Task 4a (runner.py + claude.py) — subprocess lifecycle state machine:
- JsonlStreamState.lifecycle_state + lifecycle_state_entered_at.
- JsonlSubprocessRunner._transition_lifecycle() helper emits
  ``subprocess.state.<name>`` info logs at every transition.
- States emitted by the watchdog: reader_eof, subcountdown, limbo,
  sigterm_sent, sigkill_sent, exited. Other transitions (streaming,
  idle_post_result, tool_active) deferred to a future patch.

Tests (7 new):
- test_claude_runner.py:
  - test_333_reader_done_but_alive_triggers_subcountdown
  - test_333_subprocess_exits_during_subcountdown
  - test_333_subcountdown_defers_on_pending_request
  - test_333_lifecycle_state_transitions_logged
- test_exec_bridge.py:
  - test_333_post_result_limbo_lets_auto_cancel_fire
  - test_333_post_result_below_limbo_threshold_still_suppresses
  - test_333_post_result_with_pending_wakeup_keeps_suppression

Full suite: 2678 passed, 2 skipped (no regressions from rc17 baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 633d2dca-bdf6-4bb3-9bd3-3718b20db45d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/333-post-result-hang

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

In fresh CI containers ``time.monotonic()`` returns small values (~50s),
but the test's fake clock starts at 1000.0. Computing the ScheduleWakeup
deadline from real monotonic time made it look already-expired against
the fake clock — so _has_pending_wakeup returned None in CI, _real_pending
went False, and auto-cancel fired (test asserted not fired).

Express the deadline in the fake clock's frame (1010 + 60 = 1070) so the
comparison is consistent regardless of host monotonic baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nathanschram Nathan Schram (nathanschram) merged commit 1250bbe into dev May 17, 2026
21 checks passed
@nathanschram Nathan Schram (nathanschram) deleted the fix/333-post-result-hang branch May 17, 2026 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant