Skip to content

investigate: Claude CLI session stays alive for ~36 min after final result event (idle-but-alive UX gap) #333

@nathanschram

Description

Symptom

During a live staging session on 2026-04-17, the user perceived my session as "stuck" — no visible response was coming. In reality, I had finished my turn and sent a final summary message at 07:43:35 UTC. The Claude CLI process then sat idle for ~36 minutes until the user sent a new prompt (which arrived at 08:20:27 UTC and triggered the old session to be cancelled first).

From the user's perspective, the UI looked unresponsive. From the logs, everything was "fine" — no crash, no hang, just a long idle window.

Evidence (staging logs, session 8e3fbd04-e7c2-46af-b088-929c18541a71, PID 1854247)

Claude Code session transcript (~/.claude/projects/-home-nathan-untether/8e3fbd04-…jsonl):

  • line 884 2026-04-17T07:43:05.098Z — final TodoWrite tool_use by the model
  • line 885 2026-04-17T07:43:05.101Z — tool_result
  • line 886 2026-04-17T07:43:35.124Z — final assistant text "Done. Local commit ready for your review…"
  • line 888 2026-04-17T07:43:35.305Zpr-link telemetry entry
  • (gap: ~36 minutes 15 seconds)
  • line 890 2026-04-17T08:19:50.337Z — next entry; corresponds to the session being cancelled + resumed

Untether staging logs (journalctl --user -u untether):

07:58:19.670 progress_edits.stall_detected
  cpu_active=True last_event_type=result
  seconds_since_last_event=914.5 stall_warn_count=1
  process_state=S last_action='note:update todos (done)'
  recent_events=[(52963.2,'user'),(52972.0,'assistant'),(52972.1,'user'),(53002.1,'assistant'),(53002.2,'result')]
  tcp_established=108 tcp_total=197 rss_kb=403832
  child_pids=[1854274, 1854278, 1854286, 1854646, 1854746, 1854774, 1854872, 1855015, 1921232]
...
08:19:50.592 session.summary cancelled=True peak_idle_seconds=2175.2 stall_warnings=8

Eight stall warnings fired over ~36 minutes. Every one was suppressed by progress_edits.stall_children_active_suppressed because child MCP servers (brave-search, trello, chrome-devtools, pal, apify, github-copilot) stayed CPU-active throughout.

Root cause

The Claude Code CLI is invoked with --permission-mode plan --permission-prompt-tool stdio --input-format stream-json --output-format stream-json --verbose — i.e. bidirectional SDK-style. After emitting a result event for a turn, the CLI stays alive waiting for the next user message on stdin. This is intentional so multi-turn sessions don't need to re-spawn.

Untether keeps the subprocess alive to preserve the control-channel plumbing (_SESSION_STDIN, approval registries, etc.). Stall detection exists but is deliberately suppressed while child processes are CPU-active (active_children_suppressed at runner_bridge.py), because those children are usually legitimate MCP servers doing background work during long tool turns.

The gap: once the model has emitted a result event, further child-process CPU activity is no longer evidence of forward progress — it's just MCP servers idling with heartbeats/polling/GC. The stall heuristic doesn't distinguish "child CPU during a turn" from "child CPU between turns."

Resource cost during the idle window is non-trivial: PID 1854247 held 395 MB RSS, 124 FDs, and up to 206 TCP sockets (across child MCP servers) while doing no productive work.

Why the user perceived "stuck"

The UX signal is ambiguous. After the final message, the Telegram rendering doesn't clearly distinguish:

  • "Turn complete; waiting for your next prompt" (✓ what actually happened)
  • "Still processing; wait for me to finish" (✗ what it can look like)

There's no footer change, no explicit "idle" indicator, no change in the progress message — just silence. If the user's expectation is "Claude will continue on its own toward a larger goal," silence feels like stuck.

Scope

This is a Claude-only issue because Claude is the only engine that runs in permission-mode bidirectional SDK protocol. Codex/OpenCode/Pi/Gemini/AMP currently run in one-shot modes where the CLI exits after the result event.

Related to but distinct from #322 (stuck-after-MCP-tool_result). That was a genuine hang inside the CLI; this is a "CLI idle, process alive, behaving as designed."

Candidate fixes (pick after more investigation)

Option A — post-result idle timeout (preferred): after a result event, if no new user input arrives within post_result_idle_timeout (e.g. default 10–15 min), close stdin on the Claude subprocess so it exits gracefully. User can still send follow-up messages; a new session spawns for the next turn. Trade-off: loses in-memory session state (approval registries, _SESSION_STDIN), but Claude's --resume token survives so continuity is preserved. Configurable via [watchdog] post_result_idle_timeout.

Option B — active-children-aware stall semantics: after the most recent event is result (or completion), drop the active_children_suppressed pass and let the normal stall path fire. Treats post-turn idle as a genuine stall after the existing tool_timeout threshold. Lower risk of interrupting legitimate in-turn work.

Option C — UX signal: edit the final message (or add a small suffix) to indicate "turn complete, waiting for your next prompt." Doesn't reclaim resources but removes the ambiguity. Could be gated by a [footer] show_idle_indicator option. Cheap to implement, doesn't address the resource-waste side.

Option D — hybrid: A + C. Visible "waiting" indicator immediately after result; SIGTERM/stdin-close at post_result_idle_timeout. Best UX but two moving parts.

Research before implementing

  1. Measure actual resource cost of an idle post-result session across a representative sample (RSS, FD count, TCP sockets, CPU ticks). Determines how urgent the fix is.
  2. Confirm --resume reliably survives a full CLI exit including lingering MCP servers — if not, Option A has caveats around control-channel state.
  3. Check whether upstream Claude Code has any idle_timeout / exit_after_turn flag we could leverage instead of closing stdin from our side.
  4. Confirm the observed behaviour reproduces with simpler MCP configurations (isolate whether mcp-remote or any specific MCP is artificially keeping the process "CPU-active").

Acceptance criteria

  • A post-turn idle window (10–15 min default, configurable) after which Untether actively ends the Claude subprocess.
  • Stall warnings no longer fire for idle-after-result sessions — they're replaced by a clean shutdown log.
  • User-visible signal in Telegram when the turn is genuinely complete (footer indicator, explicit "waiting" state, or similar).
  • Regression test covering the exact scenario: emit a result event, advance the clock past the configured idle timeout with no new user input, assert the subprocess is terminated cleanly.

Related


Investigation date: 2026-04-17
Staging version: v0.35.1 (@hetz_lba1_bot)
Dev version: feature/330-per-cron-permission-mode (@untether_dev_bot)
Session that triggered: 8e3fbd04-e7c2-46af-b088-929c18541a71

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingengine:claudeClaude Code CLI (Anthropic)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions