fix: Guard createAnchor against resurrecting a killed tmux server#234
Conversation
…rrection) resolveBootstrap now runs a side-effect-free `tmux -L <socket> list-sessions` liveness probe FIRST and declines a genuinely dead server before createAnchor runs. Previously createAnchor's `new-session` implicitly started (resurrected) a killed server on every reconnect dial, so UI-initiated kill-server never stuck — the reconnect FSM re-created the dead server ~250ms later. The probe is folded into the existing first-session listing (new probeAndFirstSession helper), so net tmux round-trips stay flat at 4. A live server — including the alive-but-zero-real-session exit-empty floor case (Constitution VI) — still passes the probe and gets its _rk-ctl anchor. Dead-vs-alive classification is mirrored locally in tmuxctl via named-constant stderr matching; no cross-package export. Backend-only; api/servers.go and the kill path are untouched.
There was a problem hiding this comment.
Pull request overview
This PR fixes a tmux server “resurrection” bug where tmuxctl.resolveBootstrap → createAnchor could implicitly restart a tmux server after a UI-initiated kill-server, preventing kills from “sticking”. The change makes bootstrapping join-only by probing server liveness with a side-effect-free list-sessions before attempting anchor creation.
Changes:
- Reorders
resolveBootstrapto run alist-sessionsliveness probe first (viaprobeAndFirstSession) and decline with a sentinel error when the server is dead, preventingcreateAnchorfrom restarting tmux. - Adds dead-server stderr classification helpers (
isServerDeadError/matchesServerDeadText) pluserrServerDeadsentinel behavior. - Adds/updates real-tmux integration tests (dead-server decline, live-server success) and updates tmux anchor documentation to reflect the new invariant.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| app/backend/internal/tmuxctl/client.go | Implements probe-first bootstrap + dead-server decline sentinel to prevent tmux resurrection. |
| app/backend/internal/tmuxctl/integration_test.go | Adds/updates integration coverage for dead-server decline and live-server success cases. |
| docs/memory/run-kit/tmux-sessions.md | Documents the “join-only” anchor invariant and new probe-first behavior. |
| fab/changes/260602-poka-guard-anchor-no-resurrect/intake.md | Captures investigation context, root cause, and intended behavioral guarantees. |
| fab/changes/260602-poka-guard-anchor-no-resurrect/plan.md | Records requirements/tasks/acceptance criteria for the fix and tests. |
| fab/changes/260602-poka-guard-anchor-no-resurrect/.status.yaml | Tracks fab pipeline status/metrics for the change. |
| fab/changes/260602-poka-guard-anchor-no-resurrect/.history.jsonl | Records fab stage-transition history for the change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // duty — it both distinguishes a dead server (exit 1, "no server running" / | ||
| // "failed to connect") from a live one (exit 0, including the alive-but-zero-real- | ||
| // session floor case), AND yields the first real session for the attach target. | ||
| // Folding the probe into the existing listing keeps net tmux round-trips flat at |
There was a problem hiding this comment.
Fixed — resolveBootstrap comment now lists all three dead-server fragments (added No such file or directory) to match matchesServerDeadText. (c7fbf04)
| **Floor vs. attach target are decoupled** (since `260602-a1wo-prevent-exit-empty-server-death`). `resolveBootstrap` (`client.go`) does two separate things on every dial: (1) **always** `createAnchor` + `setAnchorKeepalive` (the floor — R1), then (2) pick the attach target — the first real session if one exists, else `_rk-ctl` (R2). The attach target is the first name returned by `firstSessionName`, which runs `tmux [-L <socket>] list-sessions -F '#{session_name}'` and now **skips `_rk-ctl`** so the always-present anchor is never picked when a real session exists (`_rk-ctl` sorts ahead of a lowercase name like `runkit`, so an unskipped listing would wrongly select it and regress the "prefer a real session" contract). (Pre-`260602-qn62`, `rk-relay-*` ephemerals were valid attach targets and not skipped; they no longer exist. `_rk-pin-*` pin-sessions are single-window board sessions — `firstSessionName` is in the untouched tmuxctl path, so whether it prefers or skips them is out of scope for `260602-qn62`.) Attaching control mode to `_rk-ctl` would also be correct (`%session-window-changed` is global on tmux 3.6a — see [[tmux-control-mode-event-scope]]); preferring a real session is a minimal-diff, zero-event-scope-risk choice, not a correctness requirement. | ||
| **Floor vs. attach target are decoupled** (since `260602-a1wo-prevent-exit-empty-server-death`). On every dial `resolveBootstrap` (`client.go`) does two separate things **once the server is confirmed alive** (see § Anchor is join-only below): (1) `createAnchor` + `setAnchorKeepalive` (the floor — R1), then (2) pick the attach target — the first real session if one exists, else `_rk-ctl` (R2). The attach target is the first real name surfaced by `probeAndFirstSession`, which runs `tmux [-L <socket>] list-sessions -F '#{session_name}'` and **skips `_rk-ctl`** so the always-present anchor is never picked when a real session exists (`_rk-ctl` sorts ahead of a lowercase name like `runkit`, so an unskipped listing would wrongly select it and regress the "prefer a real session" contract). (Pre-`260602-qn62`, `rk-relay-*` ephemerals were valid attach targets and not skipped; they no longer exist. `_rk-pin-*` pin-sessions are single-window board sessions; whether the picker prefers or skips them is out of scope here.) Attaching control mode to `_rk-ctl` would also be correct (`%session-window-changed` is global on tmux 3.6a — see [[tmux-control-mode-event-scope]]); preferring a real session is a minimal-diff, zero-event-scope-risk choice, not a correctness requirement. | ||
|
|
||
| **Anchor is join-only — never resurrects a killed server** (since `260602-poka-guard-anchor-no-resurrect`). `createAnchor`'s `tmux new-session -d -s _rk-ctl` has an implicit side effect: it *starts* a server when none is listening on the socket. Because `resolveBootstrap` is also the reconnect FSM's dial (`productionDial`, ~250ms→5s backoff), a UI-initiated `kill-server` used to lose a race — the server died, the FSM re-dialed ~250ms later, and `createAnchor` **recreated the dead server** (the new socket file also fired an fsnotify `CREATE` that re-opened a fresh Client — a self-reinforcing resurrection loop). The fix makes `createAnchor` **only ever join an already-running server**: `resolveBootstrap` runs a side-effect-free `list-sessions` probe FIRST via `probeAndFirstSession` (which doubles as the first-real-session selector — folding in what the former `firstSessionName` did, so net tmux round-trips stay flat at **4**: `SetExitEmptyOff`, the unified probe+first-session, `createAnchor`, `setAnchorKeepalive`). A genuinely dead server — `list-sessions` exit 1 whose stderr matches `no server running` / `failed to connect` / `No such file or directory` (classified by `isServerDeadError` → `matchesServerDeadText`, the same text `internal/tmux` and `board.go`'s `isAbsentOption` key off) — is declined with the `errServerDead` sentinel **before** `createAnchor` runs. `productionDial` propagates it, the reconnect FSM backs off instead of resurrecting, and the kill sticks. **Constitution VI floor is preserved**: an alive-but-zero-session server (probe exit 0, empty output) still passes and still gets its `_rk-ctl` anchor — only a truly dead server (exit 1) is declined. The probe lives locally in `tmuxctl` (mirroring `internal/tmux.probeServerAlive`'s `exec.CommandContext` + 2s-timeout pattern rather than calling cross-package) because `doc.go` documents `tmuxctl` as the sanctioned self-contained bypass of the `internal/tmux/` boundary, and `probeServerAlive` collapses the dead-vs-empty distinction this guard must surface. |
There was a problem hiding this comment.
Fixed — parenthetical no longer overstates the shared match set: internal/tmux's ListKeys/KillServer match all three including the socket-file variant, while board.go's isAbsentOption matches only no server running / failed to connect. Verified against the source. (c7fbf04)
Correct two doc-accuracy comments flagged in PR review: - client.go resolveBootstrap comment now lists all three dead-server stderr fragments (adds "No such file or directory") to match matchesServerDeadText. - tmux-sessions.md no longer overstates the shared match set: internal/tmux's ListKeys/KillServer match all three including the socket-file variant, while board.go's isAbsentOption matches only "no server running" / "failed to connect".
Meta
Pipeline: intake ✓ → apply ✓ → review ✓ → hydrate ✓ → ship → review-pr
Impact: +299/−40 code (excluding
fab/,docs/) · +594/−42 totalSummary
Killing any tmux server from run-kit never stuck — the server reappeared within ~250ms whether or not it had sessions. Root cause:
tmuxctl.createAnchorrunstmux new-session -d -s _rk-ctlon every dial and reconnect, andnew-sessionimplicitly starts a server when none is listening. After a UI kill, the still-alive Client's reconnect FSM re-dialed andcreateAnchorresurrected the dead server, kicking off a self-reinforcing fsnotify re-open loop. This guardscreateAnchorso it only ever joins an already-running server, making UI-initiatedkill-serveractually stick.Changes
createAnchorto never start a server — the anchor's only legitimate job is to add the_rk-ctlkeepalive to a server that is already alive; it now declines (returnserrServerDead) when no server is listening.resolveBootstrapto fold the liveness probe into the existinglist-sessions(no added round-trip) — a single side-effect-freetmux -L <socket> list-sessionsruns FIRST via the newprobeAndFirstSessionhelper, serving double duty: it distinguishes a dead server (exit 1,no server running/failed to connect) from a live one (exit 0, including the alive-but-zero-session floor) AND yields the first real session for the attach target.createAnchoris reached only after the probe confirms the server is up, so it can never resurrect. Net tmux round-trips stay flat at 4.Behavior after the fix
kill-server→ server dies → reconnect FSM re-dials →resolveBootstrapprobes, finds no server, declines → FSM backs off (250ms→5s) without recreating the server. Stays dead.list-sessionsreturns exit 0 / empty, probe passes, anchor still created — Constitution VI floor preserved.🤖 Generated with Claude Code