Symptom
During a live staging session on 2026-04-17, the user perceived my session as "stuck" — no visible response was coming. In reality, I had finished my turn and sent a final summary message at 07:43:35 UTC. The Claude CLI process then sat idle for ~36 minutes until the user sent a new prompt (which arrived at 08:20:27 UTC and triggered the old session to be cancelled first).
From the user's perspective, the UI looked unresponsive. From the logs, everything was "fine" — no crash, no hang, just a long idle window.
Evidence (staging logs, session 8e3fbd04-e7c2-46af-b088-929c18541a71, PID 1854247)
Claude Code session transcript (~/.claude/projects/-home-nathan-untether/8e3fbd04-…jsonl):
- line 884
2026-04-17T07:43:05.098Z — final TodoWrite tool_use by the model
- line 885
2026-04-17T07:43:05.101Z — tool_result
- line 886
2026-04-17T07:43:35.124Z — final assistant text "Done. Local commit ready for your review…"
- line 888
2026-04-17T07:43:35.305Z — pr-link telemetry entry
- (gap: ~36 minutes 15 seconds)
- line 890
2026-04-17T08:19:50.337Z — next entry; corresponds to the session being cancelled + resumed
Untether staging logs (journalctl --user -u untether):
07:58:19.670 progress_edits.stall_detected
cpu_active=True last_event_type=result
seconds_since_last_event=914.5 stall_warn_count=1
process_state=S last_action='note:update todos (done)'
recent_events=[(52963.2,'user'),(52972.0,'assistant'),(52972.1,'user'),(53002.1,'assistant'),(53002.2,'result')]
tcp_established=108 tcp_total=197 rss_kb=403832
child_pids=[1854274, 1854278, 1854286, 1854646, 1854746, 1854774, 1854872, 1855015, 1921232]
...
08:19:50.592 session.summary cancelled=True peak_idle_seconds=2175.2 stall_warnings=8
Eight stall warnings fired over ~36 minutes. Every one was suppressed by progress_edits.stall_children_active_suppressed because child MCP servers (brave-search, trello, chrome-devtools, pal, apify, github-copilot) stayed CPU-active throughout.
Root cause
The Claude Code CLI is invoked with --permission-mode plan --permission-prompt-tool stdio --input-format stream-json --output-format stream-json --verbose — i.e. bidirectional SDK-style. After emitting a result event for a turn, the CLI stays alive waiting for the next user message on stdin. This is intentional so multi-turn sessions don't need to re-spawn.
Untether keeps the subprocess alive to preserve the control-channel plumbing (_SESSION_STDIN, approval registries, etc.). Stall detection exists but is deliberately suppressed while child processes are CPU-active (active_children_suppressed at runner_bridge.py), because those children are usually legitimate MCP servers doing background work during long tool turns.
The gap: once the model has emitted a result event, further child-process CPU activity is no longer evidence of forward progress — it's just MCP servers idling with heartbeats/polling/GC. The stall heuristic doesn't distinguish "child CPU during a turn" from "child CPU between turns."
Resource cost during the idle window is non-trivial: PID 1854247 held 395 MB RSS, 124 FDs, and up to 206 TCP sockets (across child MCP servers) while doing no productive work.
Why the user perceived "stuck"
The UX signal is ambiguous. After the final message, the Telegram rendering doesn't clearly distinguish:
- "Turn complete; waiting for your next prompt" (✓ what actually happened)
- "Still processing; wait for me to finish" (✗ what it can look like)
There's no footer change, no explicit "idle" indicator, no change in the progress message — just silence. If the user's expectation is "Claude will continue on its own toward a larger goal," silence feels like stuck.
Scope
This is a Claude-only issue because Claude is the only engine that runs in permission-mode bidirectional SDK protocol. Codex/OpenCode/Pi/Gemini/AMP currently run in one-shot modes where the CLI exits after the result event.
Related to but distinct from #322 (stuck-after-MCP-tool_result). That was a genuine hang inside the CLI; this is a "CLI idle, process alive, behaving as designed."
Candidate fixes (pick after more investigation)
Option A — post-result idle timeout (preferred): after a result event, if no new user input arrives within post_result_idle_timeout (e.g. default 10–15 min), close stdin on the Claude subprocess so it exits gracefully. User can still send follow-up messages; a new session spawns for the next turn. Trade-off: loses in-memory session state (approval registries, _SESSION_STDIN), but Claude's --resume token survives so continuity is preserved. Configurable via [watchdog] post_result_idle_timeout.
Option B — active-children-aware stall semantics: after the most recent event is result (or completion), drop the active_children_suppressed pass and let the normal stall path fire. Treats post-turn idle as a genuine stall after the existing tool_timeout threshold. Lower risk of interrupting legitimate in-turn work.
Option C — UX signal: edit the final message (or add a small suffix) to indicate "turn complete, waiting for your next prompt." Doesn't reclaim resources but removes the ambiguity. Could be gated by a [footer] show_idle_indicator option. Cheap to implement, doesn't address the resource-waste side.
Option D — hybrid: A + C. Visible "waiting" indicator immediately after result; SIGTERM/stdin-close at post_result_idle_timeout. Best UX but two moving parts.
Research before implementing
- Measure actual resource cost of an idle post-result session across a representative sample (RSS, FD count, TCP sockets, CPU ticks). Determines how urgent the fix is.
- Confirm
--resume reliably survives a full CLI exit including lingering MCP servers — if not, Option A has caveats around control-channel state.
- Check whether upstream Claude Code has any
idle_timeout / exit_after_turn flag we could leverage instead of closing stdin from our side.
- Confirm the observed behaviour reproduces with simpler MCP configurations (isolate whether
mcp-remote or any specific MCP is artificially keeping the process "CPU-active").
Acceptance criteria
- A post-turn idle window (10–15 min default, configurable) after which Untether actively ends the Claude subprocess.
- Stall warnings no longer fire for idle-after-result sessions — they're replaced by a clean shutdown log.
- User-visible signal in Telegram when the turn is genuinely complete (footer indicator, explicit "waiting" state, or similar).
- Regression test covering the exact scenario: emit a
result event, advance the clock past the configured idle timeout with no new user input, assert the subprocess is terminated cleanly.
Related
Investigation date: 2026-04-17
Staging version: v0.35.1 (@hetz_lba1_bot)
Dev version: feature/330-per-cron-permission-mode (@untether_dev_bot)
Session that triggered: 8e3fbd04-e7c2-46af-b088-929c18541a71
Symptom
During a live staging session on 2026-04-17, the user perceived my session as "stuck" — no visible response was coming. In reality, I had finished my turn and sent a final summary message at 07:43:35 UTC. The Claude CLI process then sat idle for ~36 minutes until the user sent a new prompt (which arrived at 08:20:27 UTC and triggered the old session to be cancelled first).
From the user's perspective, the UI looked unresponsive. From the logs, everything was "fine" — no crash, no hang, just a long idle window.
Evidence (staging logs, session
8e3fbd04-e7c2-46af-b088-929c18541a71, PID 1854247)Claude Code session transcript (~/.claude/projects/-home-nathan-untether/8e3fbd04-…jsonl):
2026-04-17T07:43:05.098Z— finalTodoWritetool_use by the model2026-04-17T07:43:05.101Z— tool_result2026-04-17T07:43:35.124Z— final assistant text "Done. Local commit ready for your review…"2026-04-17T07:43:35.305Z—pr-linktelemetry entry2026-04-17T08:19:50.337Z— next entry; corresponds to the session being cancelled + resumedUntether staging logs (
journalctl --user -u untether):Eight stall warnings fired over ~36 minutes. Every one was suppressed by
progress_edits.stall_children_active_suppressedbecause child MCP servers (brave-search, trello, chrome-devtools, pal, apify, github-copilot) stayed CPU-active throughout.Root cause
The Claude Code CLI is invoked with
--permission-mode plan --permission-prompt-tool stdio --input-format stream-json --output-format stream-json --verbose— i.e. bidirectional SDK-style. After emitting aresultevent for a turn, the CLI stays alive waiting for the next user message on stdin. This is intentional so multi-turn sessions don't need to re-spawn.Untether keeps the subprocess alive to preserve the control-channel plumbing (
_SESSION_STDIN, approval registries, etc.). Stall detection exists but is deliberately suppressed while child processes are CPU-active (active_children_suppressedatrunner_bridge.py), because those children are usually legitimate MCP servers doing background work during long tool turns.The gap: once the model has emitted a
resultevent, further child-process CPU activity is no longer evidence of forward progress — it's just MCP servers idling with heartbeats/polling/GC. The stall heuristic doesn't distinguish "child CPU during a turn" from "child CPU between turns."Resource cost during the idle window is non-trivial: PID 1854247 held 395 MB RSS, 124 FDs, and up to 206 TCP sockets (across child MCP servers) while doing no productive work.
Why the user perceived "stuck"
The UX signal is ambiguous. After the final message, the Telegram rendering doesn't clearly distinguish:
There's no footer change, no explicit "idle" indicator, no change in the progress message — just silence. If the user's expectation is "Claude will continue on its own toward a larger goal," silence feels like stuck.
Scope
This is a Claude-only issue because Claude is the only engine that runs in permission-mode bidirectional SDK protocol. Codex/OpenCode/Pi/Gemini/AMP currently run in one-shot modes where the CLI exits after the
resultevent.Related to but distinct from #322 (stuck-after-MCP-tool_result). That was a genuine hang inside the CLI; this is a "CLI idle, process alive, behaving as designed."
Candidate fixes (pick after more investigation)
Option A — post-result idle timeout (preferred): after a
resultevent, if no new user input arrives withinpost_result_idle_timeout(e.g. default 10–15 min), close stdin on the Claude subprocess so it exits gracefully. User can still send follow-up messages; a new session spawns for the next turn. Trade-off: loses in-memory session state (approval registries,_SESSION_STDIN), but Claude's--resumetoken survives so continuity is preserved. Configurable via[watchdog] post_result_idle_timeout.Option B — active-children-aware stall semantics: after the most recent event is
result(orcompletion), drop theactive_children_suppressedpass and let the normal stall path fire. Treats post-turn idle as a genuine stall after the existingtool_timeoutthreshold. Lower risk of interrupting legitimate in-turn work.Option C — UX signal: edit the final message (or add a small suffix) to indicate "turn complete, waiting for your next prompt." Doesn't reclaim resources but removes the ambiguity. Could be gated by a
[footer] show_idle_indicatoroption. Cheap to implement, doesn't address the resource-waste side.Option D — hybrid: A + C. Visible "waiting" indicator immediately after
result; SIGTERM/stdin-close atpost_result_idle_timeout. Best UX but two moving parts.Research before implementing
--resumereliably survives a full CLI exit including lingering MCP servers — if not, Option A has caveats around control-channel state.idle_timeout/exit_after_turnflag we could leverage instead of closing stdin from our side.mcp-remoteor any specific MCP is artificially keeping the process "CPU-active").Acceptance criteria
resultevent, advance the clock past the configured idle timeout with no new user input, assert the subprocess is terminated cleanly.Related
BotResponseTimeoutError(another Claude-session UX issue)Investigation date: 2026-04-17
Staging version: v0.35.1 (@hetz_lba1_bot)
Dev version: feature/330-per-cron-permission-mode (@untether_dev_bot)
Session that triggered:
8e3fbd04-e7c2-46af-b088-929c18541a71