investigate: Claude CLI session stays alive for ~36 min after final `result` event (idle-but-alive UX gap)

## Symptom

During a live staging session on 2026-04-17, the user perceived my session as "stuck" — no visible response was coming. In reality, I had finished my turn and sent a final summary message at 07:43:35 UTC. The Claude CLI process then sat idle for ~36 minutes until the user sent a new prompt (which arrived at 08:20:27 UTC and triggered the old session to be cancelled first).

From the user's perspective, the UI looked unresponsive. From the logs, everything was "fine" — no crash, no hang, just a long idle window.

## Evidence (staging logs, session `8e3fbd04-e7c2-46af-b088-929c18541a71`, PID 1854247)

Claude Code session transcript (~/.claude/projects/-home-nathan-untether/8e3fbd04-…jsonl):

- **line 884** `2026-04-17T07:43:05.098Z` — final `TodoWrite` tool_use by the model
- **line 885** `2026-04-17T07:43:05.101Z` — tool_result
- **line 886** `2026-04-17T07:43:35.124Z` — final assistant text "Done. Local commit ready for your review…"
- **line 888** `2026-04-17T07:43:35.305Z` — `pr-link` telemetry entry
- (gap: ~36 minutes 15 seconds)
- **line 890** `2026-04-17T08:19:50.337Z` — next entry; corresponds to the session being cancelled + resumed

Untether staging logs (`journalctl --user -u untether`):

```
07:58:19.670 progress_edits.stall_detected
  cpu_active=True last_event_type=result
  seconds_since_last_event=914.5 stall_warn_count=1
  process_state=S last_action='note:update todos (done)'
  recent_events=[(52963.2,'user'),(52972.0,'assistant'),(52972.1,'user'),(53002.1,'assistant'),(53002.2,'result')]
  tcp_established=108 tcp_total=197 rss_kb=403832
  child_pids=[1854274, 1854278, 1854286, 1854646, 1854746, 1854774, 1854872, 1855015, 1921232]
...
08:19:50.592 session.summary cancelled=True peak_idle_seconds=2175.2 stall_warnings=8
```

Eight stall warnings fired over ~36 minutes. Every one was suppressed by `progress_edits.stall_children_active_suppressed` because child MCP servers (brave-search, trello, chrome-devtools, pal, apify, github-copilot) stayed CPU-active throughout.

## Root cause

The Claude Code CLI is invoked with `--permission-mode plan --permission-prompt-tool stdio --input-format stream-json --output-format stream-json --verbose` — i.e. bidirectional SDK-style. After emitting a `result` event for a turn, the CLI **stays alive** waiting for the next user message on stdin. This is intentional so multi-turn sessions don't need to re-spawn.

Untether keeps the subprocess alive to preserve the control-channel plumbing (`_SESSION_STDIN`, approval registries, etc.). Stall detection exists but is deliberately suppressed while child processes are CPU-active (`active_children_suppressed` at `runner_bridge.py`), because those children are usually legitimate MCP servers doing background work during long tool turns.

The gap: **once the model has emitted a `result` event, further child-process CPU activity is no longer evidence of forward progress — it's just MCP servers idling with heartbeats/polling/GC**. The stall heuristic doesn't distinguish "child CPU during a turn" from "child CPU between turns."

Resource cost during the idle window is non-trivial: PID 1854247 held 395 MB RSS, 124 FDs, and up to 206 TCP sockets (across child MCP servers) while doing no productive work.

## Why the user perceived "stuck"

The UX signal is ambiguous. After the final message, the Telegram rendering doesn't clearly distinguish:

- "Turn complete; waiting for your next prompt" (✓ what actually happened)
- "Still processing; wait for me to finish" (✗ what it can look like)

There's no footer change, no explicit "idle" indicator, no change in the progress message — just silence. If the user's expectation is "Claude will continue on its own toward a larger goal," silence feels like stuck.

## Scope

This is a Claude-only issue because Claude is the only engine that runs in permission-mode bidirectional SDK protocol. Codex/OpenCode/Pi/Gemini/AMP currently run in one-shot modes where the CLI exits after the `result` event.

Related to but distinct from #322 (stuck-after-MCP-tool_result). That was a genuine hang inside the CLI; this is a "CLI idle, process alive, behaving as designed."

## Candidate fixes (pick after more investigation)

Option A — **post-result idle timeout** (preferred): after a `result` event, if no new user input arrives within `post_result_idle_timeout` (e.g. default 10–15 min), close stdin on the Claude subprocess so it exits gracefully. User can still send follow-up messages; a new session spawns for the next turn. Trade-off: loses in-memory session state (approval registries, `_SESSION_STDIN`), but Claude's `--resume` token survives so continuity is preserved. Configurable via `[watchdog] post_result_idle_timeout`.

Option B — **active-children-aware stall semantics**: after the most recent event is `result` (or `completion`), drop the `active_children_suppressed` pass and let the normal stall path fire. Treats post-turn idle as a genuine stall after the existing `tool_timeout` threshold. Lower risk of interrupting legitimate in-turn work.

Option C — **UX signal**: edit the final message (or add a small suffix) to indicate "turn complete, waiting for your next prompt." Doesn't reclaim resources but removes the ambiguity. Could be gated by a `[footer] show_idle_indicator` option. Cheap to implement, doesn't address the resource-waste side.

Option D — **hybrid**: A + C. Visible "waiting" indicator immediately after `result`; SIGTERM/stdin-close at `post_result_idle_timeout`. Best UX but two moving parts.

## Research before implementing

1. Measure actual resource cost of an idle post-result session across a representative sample (RSS, FD count, TCP sockets, CPU ticks). Determines how urgent the fix is.
2. Confirm `--resume` reliably survives a full CLI exit including lingering MCP servers — if not, Option A has caveats around control-channel state.
3. Check whether upstream Claude Code has any `idle_timeout` / `exit_after_turn` flag we could leverage instead of closing stdin from our side.
4. Confirm the observed behaviour reproduces with simpler MCP configurations (isolate whether `mcp-remote` or any specific MCP is artificially keeping the process "CPU-active").

## Acceptance criteria

- A post-turn idle window (10–15 min default, configurable) after which Untether actively ends the Claude subprocess.
- Stall warnings no longer fire for idle-after-result sessions — they're replaced by a clean shutdown log.
- User-visible signal in Telegram when the turn is genuinely complete (footer indicator, explicit "waiting" state, or similar).
- Regression test covering the exact scenario: emit a `result` event, advance the clock past the configured idle timeout with no new user input, assert the subprocess is terminated cleanly.

## Related

- #322 — Claude stuck after MCP tool_result (different root cause, similar symptom)
- #247 — approval callback `BotResponseTimeoutError` (another Claude-session UX issue)
- #275 — orphaned subprocess cleanup (same class of resource-accounting problem)

---

**Investigation date:** 2026-04-17
**Staging version:** v0.35.1 (@hetz_lba1_bot)
**Dev version:** feature/330-per-cron-permission-mode (@untether_dev_bot)
**Session that triggered:** `8e3fbd04-e7c2-46af-b088-929c18541a71`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate: Claude CLI session stays alive for ~36 min after final `result` event (idle-but-alive UX gap) #333

Symptom

Evidence (staging logs, session `8e3fbd04-e7c2-46af-b088-929c18541a71`, PID 1854247)

Root cause

Why the user perceived "stuck"

Scope

Candidate fixes (pick after more investigation)

Research before implementing

Acceptance criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

investigate: Claude CLI session stays alive for ~36 min after final result event (idle-but-alive UX gap) #333

Description

Symptom

Evidence (staging logs, session 8e3fbd04-e7c2-46af-b088-929c18541a71, PID 1854247)

Root cause

Why the user perceived "stuck"

Scope

Candidate fixes (pick after more investigation)

Research before implementing

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

investigate: Claude CLI session stays alive for ~36 min after final `result` event (idle-but-alive UX gap) #333

Evidence (staging logs, session `8e3fbd04-e7c2-46af-b088-929c18541a71`, PID 1854247)