Skip to content

fix: Guard createAnchor against resurrecting a killed tmux server#234

Merged
sahil-noon merged 3 commits into
mainfrom
260602-poka-guard-anchor-no-resurrect
Jun 3, 2026
Merged

fix: Guard createAnchor against resurrecting a killed tmux server#234
sahil-noon merged 3 commits into
mainfrom
260602-poka-guard-anchor-no-resurrect

Conversation

@sahil-noon
Copy link
Copy Markdown
Collaborator

Meta

ID Type Confidence Plan Review
poka fix 3.7/5.0 4/4 tasks, 12/12 acceptance ✓ ✓ 1 cycle

Pipeline: intake ✓ → apply ✓ → review ✓ → hydrate ✓ → ship → review-pr

Impact: +299/−40 code (excluding fab/, docs/) · +594/−42 total

Summary

Killing any tmux server from run-kit never stuck — the server reappeared within ~250ms whether or not it had sessions. Root cause: tmuxctl.createAnchor runs tmux new-session -d -s _rk-ctl on every dial and reconnect, and new-session implicitly starts a server when none is listening. After a UI kill, the still-alive Client's reconnect FSM re-dialed and createAnchor resurrected the dead server, kicking off a self-reinforcing fsnotify re-open loop. This guards createAnchor so it only ever joins an already-running server, making UI-initiated kill-server actually stick.

Changes

  • Guard createAnchor to never start a server — the anchor's only legitimate job is to add the _rk-ctl keepalive to a server that is already alive; it now declines (returns errServerDead) when no server is listening.
  • Restructure resolveBootstrap to fold the liveness probe into the existing list-sessions (no added round-trip) — a single side-effect-free tmux -L <socket> list-sessions runs FIRST via the new probeAndFirstSession helper, serving double duty: it distinguishes a dead server (exit 1, no server running / failed to connect) from a live one (exit 0, including the alive-but-zero-session floor) AND yields the first real session for the attach target. createAnchor is reached only after the probe confirms the server is up, so it can never resurrect. Net tmux round-trips stay flat at 4.

Behavior after the fix

  • Kill flow: kill-server → server dies → reconnect FSM re-dials → resolveBootstrap probes, finds no server, declines → FSM backs off (250ms→5s) without recreating the server. Stays dead.
  • First connect to a fresh live server: probe succeeds, anchor created, control mode attaches. Unchanged.
  • Alive-but-zero-session server (exit-empty floor): list-sessions returns exit 0 / empty, probe passes, anchor still created — Constitution VI floor preserved.

🤖 Generated with Claude Code

sahil87 added 2 commits June 3, 2026 05:30
…rrection)

resolveBootstrap now runs a side-effect-free `tmux -L <socket> list-sessions`
liveness probe FIRST and declines a genuinely dead server before createAnchor
runs. Previously createAnchor's `new-session` implicitly started (resurrected)
a killed server on every reconnect dial, so UI-initiated kill-server never
stuck — the reconnect FSM re-created the dead server ~250ms later.

The probe is folded into the existing first-session listing (new
probeAndFirstSession helper), so net tmux round-trips stay flat at 4. A live
server — including the alive-but-zero-real-session exit-empty floor case
(Constitution VI) — still passes the probe and gets its _rk-ctl anchor.
Dead-vs-alive classification is mirrored locally in tmuxctl via named-constant
stderr matching; no cross-package export.

Backend-only; api/servers.go and the kill path are untouched.
@sahil-noon sahil-noon requested a review from Copilot June 3, 2026 03:43
@sahil-noon sahil-noon marked this pull request as ready for review June 3, 2026 03:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a tmux server “resurrection” bug where tmuxctl.resolveBootstrap → createAnchor could implicitly restart a tmux server after a UI-initiated kill-server, preventing kills from “sticking”. The change makes bootstrapping join-only by probing server liveness with a side-effect-free list-sessions before attempting anchor creation.

Changes:

  • Reorders resolveBootstrap to run a list-sessions liveness probe first (via probeAndFirstSession) and decline with a sentinel error when the server is dead, preventing createAnchor from restarting tmux.
  • Adds dead-server stderr classification helpers (isServerDeadError / matchesServerDeadText) plus errServerDead sentinel behavior.
  • Adds/updates real-tmux integration tests (dead-server decline, live-server success) and updates tmux anchor documentation to reflect the new invariant.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
app/backend/internal/tmuxctl/client.go Implements probe-first bootstrap + dead-server decline sentinel to prevent tmux resurrection.
app/backend/internal/tmuxctl/integration_test.go Adds/updates integration coverage for dead-server decline and live-server success cases.
docs/memory/run-kit/tmux-sessions.md Documents the “join-only” anchor invariant and new probe-first behavior.
fab/changes/260602-poka-guard-anchor-no-resurrect/intake.md Captures investigation context, root cause, and intended behavioral guarantees.
fab/changes/260602-poka-guard-anchor-no-resurrect/plan.md Records requirements/tasks/acceptance criteria for the fix and tests.
fab/changes/260602-poka-guard-anchor-no-resurrect/.status.yaml Tracks fab pipeline status/metrics for the change.
fab/changes/260602-poka-guard-anchor-no-resurrect/.history.jsonl Records fab stage-transition history for the change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +417 to +420
// duty — it both distinguishes a dead server (exit 1, "no server running" /
// "failed to connect") from a live one (exit 0, including the alive-but-zero-real-
// session floor case), AND yields the first real session for the attach target.
// Folding the probe into the existing listing keeps net tmux round-trips flat at
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — resolveBootstrap comment now lists all three dead-server fragments (added No such file or directory) to match matchesServerDeadText. (c7fbf04)

Comment thread docs/memory/run-kit/tmux-sessions.md Outdated
**Floor vs. attach target are decoupled** (since `260602-a1wo-prevent-exit-empty-server-death`). `resolveBootstrap` (`client.go`) does two separate things on every dial: (1) **always** `createAnchor` + `setAnchorKeepalive` (the floor — R1), then (2) pick the attach target — the first real session if one exists, else `_rk-ctl` (R2). The attach target is the first name returned by `firstSessionName`, which runs `tmux [-L <socket>] list-sessions -F '#{session_name}'` and now **skips `_rk-ctl`** so the always-present anchor is never picked when a real session exists (`_rk-ctl` sorts ahead of a lowercase name like `runkit`, so an unskipped listing would wrongly select it and regress the "prefer a real session" contract). (Pre-`260602-qn62`, `rk-relay-*` ephemerals were valid attach targets and not skipped; they no longer exist. `_rk-pin-*` pin-sessions are single-window board sessions — `firstSessionName` is in the untouched tmuxctl path, so whether it prefers or skips them is out of scope for `260602-qn62`.) Attaching control mode to `_rk-ctl` would also be correct (`%session-window-changed` is global on tmux 3.6a — see [[tmux-control-mode-event-scope]]); preferring a real session is a minimal-diff, zero-event-scope-risk choice, not a correctness requirement.
**Floor vs. attach target are decoupled** (since `260602-a1wo-prevent-exit-empty-server-death`). On every dial `resolveBootstrap` (`client.go`) does two separate things **once the server is confirmed alive** (see § Anchor is join-only below): (1) `createAnchor` + `setAnchorKeepalive` (the floor — R1), then (2) pick the attach target — the first real session if one exists, else `_rk-ctl` (R2). The attach target is the first real name surfaced by `probeAndFirstSession`, which runs `tmux [-L <socket>] list-sessions -F '#{session_name}'` and **skips `_rk-ctl`** so the always-present anchor is never picked when a real session exists (`_rk-ctl` sorts ahead of a lowercase name like `runkit`, so an unskipped listing would wrongly select it and regress the "prefer a real session" contract). (Pre-`260602-qn62`, `rk-relay-*` ephemerals were valid attach targets and not skipped; they no longer exist. `_rk-pin-*` pin-sessions are single-window board sessions; whether the picker prefers or skips them is out of scope here.) Attaching control mode to `_rk-ctl` would also be correct (`%session-window-changed` is global on tmux 3.6a — see [[tmux-control-mode-event-scope]]); preferring a real session is a minimal-diff, zero-event-scope-risk choice, not a correctness requirement.

**Anchor is join-only — never resurrects a killed server** (since `260602-poka-guard-anchor-no-resurrect`). `createAnchor`'s `tmux new-session -d -s _rk-ctl` has an implicit side effect: it *starts* a server when none is listening on the socket. Because `resolveBootstrap` is also the reconnect FSM's dial (`productionDial`, ~250ms→5s backoff), a UI-initiated `kill-server` used to lose a race — the server died, the FSM re-dialed ~250ms later, and `createAnchor` **recreated the dead server** (the new socket file also fired an fsnotify `CREATE` that re-opened a fresh Client — a self-reinforcing resurrection loop). The fix makes `createAnchor` **only ever join an already-running server**: `resolveBootstrap` runs a side-effect-free `list-sessions` probe FIRST via `probeAndFirstSession` (which doubles as the first-real-session selector — folding in what the former `firstSessionName` did, so net tmux round-trips stay flat at **4**: `SetExitEmptyOff`, the unified probe+first-session, `createAnchor`, `setAnchorKeepalive`). A genuinely dead server — `list-sessions` exit 1 whose stderr matches `no server running` / `failed to connect` / `No such file or directory` (classified by `isServerDeadError` → `matchesServerDeadText`, the same text `internal/tmux` and `board.go`'s `isAbsentOption` key off) — is declined with the `errServerDead` sentinel **before** `createAnchor` runs. `productionDial` propagates it, the reconnect FSM backs off instead of resurrecting, and the kill sticks. **Constitution VI floor is preserved**: an alive-but-zero-session server (probe exit 0, empty output) still passes and still gets its `_rk-ctl` anchor — only a truly dead server (exit 1) is declined. The probe lives locally in `tmuxctl` (mirroring `internal/tmux.probeServerAlive`'s `exec.CommandContext` + 2s-timeout pattern rather than calling cross-package) because `doc.go` documents `tmuxctl` as the sanctioned self-contained bypass of the `internal/tmux/` boundary, and `probeServerAlive` collapses the dead-vs-empty distinction this guard must surface.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — parenthetical no longer overstates the shared match set: internal/tmux's ListKeys/KillServer match all three including the socket-file variant, while board.go's isAbsentOption matches only no server running / failed to connect. Verified against the source. (c7fbf04)

Correct two doc-accuracy comments flagged in PR review:
- client.go resolveBootstrap comment now lists all three dead-server
  stderr fragments (adds "No such file or directory") to match
  matchesServerDeadText.
- tmux-sessions.md no longer overstates the shared match set:
  internal/tmux's ListKeys/KillServer match all three including the
  socket-file variant, while board.go's isAbsentOption matches only
  "no server running" / "failed to connect".
@sahil-noon sahil-noon merged commit 242b5b6 into main Jun 3, 2026
5 checks passed
@sahil-noon sahil-noon deleted the 260602-poka-guard-anchor-no-resurrect branch June 3, 2026 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants