Skip to content

fix: add keepalive, idle timeout, and session cleanup to gateway#223

Open
x032205 wants to merge 1 commit intomainfrom
PAM-205
Open

fix: add keepalive, idle timeout, and session cleanup to gateway#223
x032205 wants to merge 1 commit intomainfrom
PAM-205

Conversation

@x032205
Copy link
Copy Markdown
Member

@x032205 x032205 commented May 8, 2026

Description 📣

Add keepalive, idle timeout, and session cleanup to gateway

Type ✨

  • Bug fix
  • New feature
  • Improvement
  • Breaking change
  • Documentation

Tests 🛠️

  1. Connect SSH through the gateway, kill -STOP the client process, watch gateway logs for "SSH keepalive to client failed" within ~45s, confirm session flips to inactive in the UI
  2. kill -STOP the relay process while a session is active, watch gateway logs for "Relay SSH keepalive failed" within ~45s, then kill -CONT the relay and confirm the gateway reconnects
  3. Connect SSH through the gateway, leave it idle for 30+ minutes (or temporarily set pamIdleTimeout to 2 * time.Minute in gateway.go:278), watch for "Reaping idle PAM session", confirm session flips to inactive
  4. Connect SSH through the gateway, type commands for a few minutes, confirm the session stays alive and the reaper does not fire

@x032205 x032205 requested a review from bernie-g May 8, 2026 08:10
@linear
Copy link
Copy Markdown

linear Bot commented May 8, 2026

PAM-205

@infisical-review-police
Copy link
Copy Markdown

💬 Discussion in Slack: #pr-review-cli-223-fix-add-keepalive-idle-timeout-and-session-cleanup-to-gateway

Posted by Review Police — reviews, comments, new commits, and CI failures will stream into this channel.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 71aba63cc3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/pam/pam-proxy.go
}
})
}
defer cleanupSession("connection_closed")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid terminating PAM session on single connection close

When a PAM session has multiple connections under the same session ID (the gateway registry explicitly stores []*pamSessionEntry, and MongoDB/mongosh retries share a session-level proxy), each HandlePAMProxy invocation gets its own cleanupOnce; the first connection that returns now runs this defer and calls CleanupPAMSession. That cleanup unregisters the session, may delete the recording/secrets, and calls the termination API (packages/pam/session/uploader.go:736-763) while sibling connections are still active or while the client is about to open another connection, so multi-connection PAM sessions can be marked ended and lose credentials/log state prematurely.

Useful? React with 👍 / 👎.

Comment on lines 525 to 534
for {
n, err := src.Read(buf)
if n > 0 {
if p.config.OnActivity != nil {
p.config.OnActivity()
}

// Check if this channel is a binary session (SFTP/SCP)
chState.mutex.Lock()
isBinary := chState.isBinarySession
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 When command blocking is enabled (BlockedCommandPatterns non-empty), client→server keystrokes/data flow through proxyClientToServerWithBlocking which never invokes p.config.OnActivity(), so they don't reset the gateway's idle reaper. SSH bypasses the activityConn wrapper (pam-proxy.go:386 passes raw conn, not handlerConn), making OnActivity the only activity-tracking path for SSH; sessions with continuous client-only data (e.g., ssh host 'cat > file' piping a long upload, paste/non-echoing modes, exec commands consuming stdin silently) can be reaped after 30 minutes despite active client traffic. Fix: mirror the n>0 nil-check from proxyData (proxy.go:527-530) at the top of the n>0 branch in proxyClientToServerWithBlocking.

Extended reasoning...

What the bug is

proxyClientToServerWithBlocking (proxy.go:759-862, replacing the call site at proxy.go:773) reads bytes from the client channel and forwards them to the server, but never invokes p.config.OnActivity() on either the binary (isBinary) branch or the interactive branch. Its non-blocking sibling proxyData does call OnActivity at proxy.go:527-530, immediately after src.Read returns n>0. The result is asymmetric activity tracking depending on whether command blocking is configured for a session.

Why this matters specifically for SSH

In pam-proxy.go the SSH branch passes the raw TLS conn to proxy.HandleConnection (line 386: return proxy.HandleConnection(ctx, conn)), not the handlerConn activityConn wrapper that other resource types use. Every other resource (postgres/mysql/mssql/redis/kubernetes/mongodb) gets handlerConn, so the wrapper's Read/Write paths fire OnActivity even if the inner protocol handler doesn't. SSH has no such fallback — OnActivity has to be invoked by the SSH proxy code itself. proxyData does that for both directions when blocking is off; with blocking on, only the server→client direction (still using proxyData) does, while client→server goes silent.

Code path triggering the bug

  1. Operator configures a PAM policy with CommandBlocking.Patterns non-empty.
  2. HandlePAMProxy compiles those into BlockedCommandPatterns and constructs the SSH proxy with them set (pam-proxy.go:362).
  3. In handleChannel (proxy.go around line 234), len(p.config.BlockedCommandPatterns) > 0 selects proxyClientToServerWithBlocking for the client→server goroutine.
  4. The function loops on src.Read(buf), parses/forwards bytes, but never touches OnActivity. lastActivity for that session entry is updated only when server→client traffic flows.

Why existing code doesn't prevent it

  • The activityConn wrapper added in this PR would have caught it via underlying-conn reads/writes, but SSH is the one resource type that intentionally bypasses the wrapper (probably because the SSH proxy already has its own per-direction read loops).
  • The server→client proxyData goroutine does fire OnActivity, which masks the bug for typical interactive shells: terminal echo means every client keystroke triggers a server→client byte that resets lastActivity. So the 'leave a shell idle for 30+ minutes' test in this PR's test plan still fires the reaper correctly, and the 'type commands and stay alive' test stays alive — both because of echo.

Impact

Idle-reaper false-positives during legitimate continuous client activity:

  • ssh host 'cat > /tmp/upload' with the client piping stdin from a large file: client→server data flows continuously, server is silent until EOF. Reaped at 30 min mid-upload.
  • Similar patterns: ssh host 'tar xf - -C /dst', ssh host 'mysql' < big.sql, scripted pipelines that feed stdin to a remote command without intermediate output.
  • Long paste into a non-echoing prompt (password entry, sudo prompts), though usually too short to actually trigger.

When the reaper fires, CancelPAMSession closes the TLS conn and cancels the context, dropping in-flight uploads.

Fix

Mirror proxyData's pattern. Add at the top of the n > 0 branch in proxyClientToServerWithBlocking (proxy.go ~line 766):

if p.config.OnActivity != nil {
    p.config.OnActivity()
}

This is exactly the snippet at proxy.go:527-530 in proxyData.

Step-by-step proof

Setup: PAM policy has commandBlocking.patterns = ['^rm\s'] (any non-empty list triggers the path). User runs cat /large/local/file | ssh -o ServerAliveInterval=0 user@host 'cat > /remote/file' through the gateway. Client side keeps the SSH session quiet on stdout/stderr — only stdin flows to the server.

  1. T=0: Session registers via RegisterPAMSession; lastActivity set to now (gateway.go:177).
  2. T=0+: HandlePAMProxy enters the SSH branch, builds SSHProxyConfig with BlockedCommandPatterns non-empty and OnActivity = pamConfig.OnActivity, calls proxy.HandleConnection(ctx, conn) — note conn, not handlerConn.
  3. handleChannel opens the session channel and dispatches the goroutines. Because patterns are non-empty, client→server takes proxyClientToServerWithBlocking.
  4. Client streams 100 MB to stdin of the remote cat. src.Read returns n>0 repeatedly. The bytes are forwarded to dst.Write(buf[:n]) (binary branch, since this is exec stdin) but OnActivity is never called.
  5. Remote cat > /remote/file produces no stdout/stderr until EOF, so proxyData (server→client) reads no data and does not call OnActivity either.
  6. T=30 min: reapIdleSessions (gateway.go:295) sees lastActivity still ≈ T=0, marks the session stale, calls CancelPAMSession which calls e.conn.Close() and e.cancel().
  7. The TLS conn closing propagates to the SSH proxy; the upload aborts mid-stream and the gateway logs "Reaping idle PAM session" despite continuous traffic.

With the one-line fix, step 4 calls OnActivity on every read, lastActivity advances each iteration, and the reaper leaves the session alone.

Why this doesn't show up in the listed test plan

Test 3 ("leave it idle for 30+ minutes") works as expected — no traffic, reaper fires.
Test 4 ("type commands for a few minutes") survives because terminal echo means every keystroke produces a server→client byte that calls OnActivity via proxyData. The asymmetry only manifests in non-echoing client→server-dominated traffic, which isn't covered.

Comment thread packages/pam/pam-proxy.go
Comment on lines +165 to +174
// Cleanup must run exactly once regardless of how the session ends
var cleanupOnce sync.Once
cleanupSession := func(reason string) {
cleanupOnce.Do(func() {
if err := pamConfig.SessionUploader.CleanupPAMSession(pamConfig.SessionId, reason); err != nil {
log.Error().Err(err).Str("sessionId", pamConfig.SessionId).Str("reason", reason).Msg("Failed to cleanup PAM session")
}
})
}
defer cleanupSession("connection_closed")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Cancellation now triggers two concurrent CleanupPAMSession calls for the same session: HandlePAMCancellation invokes it directly with reason "cancellation", while the new defer cleanupSession("connection_closed") in HandlePAMProxy fires when the cancellation closes the conn. The cleanupOnce is per-HandlePAMProxy-invocation and cannot coordinate with the separate cancellation goroutine, so the server sees two POST /v1/pam/sessions/{id}/end calls and the cleanup pipeline (flush, unregister/re-register, file delete, credentials cleanup) races with itself. Fix by sharing the once with HandlePAMCancellation (e.g. pass cleanupSession through, or skip the cancellation-side cleanup when the proxy goroutine is active), or make CleanupPAMSession idempotent at the uploader level.

Extended reasoning...

Bug: The new defer cleanupSession("connection_closed") at packages/pam/pam-proxy.go:174 regresses the cancellation flow by causing two concurrent CleanupPAMSession calls for the same session.

Cancellation path (after this PR):

  1. A user-initiated cancellation arrives over ALPN infisical-pam-session-cancellation. gateway.go::handleIncomingChannel dispatches each channel on its own goroutine (go g.handleIncomingChannel(newChannel)), so cancellation runs concurrently with the active ForwardModePAM goroutine.
  2. pam.HandlePAMCancellation (pam-proxy.go:95) calls cancelSession(sessionId)Gateway.CancelPAMSession, which closes the active session's tlsConn (e.conn.Close()) and cancels its sessionCtx (e.cancel()).
  3. It then synchronously calls pamConfig.SessionUploader.CleanupPAMSession(sessionId, "cancellation") at pam-proxy.go:105.
  4. In parallel, closing the tlsConn wakes the active HandlePAMProxy goroutine — proxy.HandleConnection returns and the new deferred cleanupSession("connection_closed") (pam-proxy.go:174) fires, calling CleanupPAMSession(sessionId, "connection_closed").

Two independent goroutines now enter SessionUploader.CleanupPAMSession for the same sessionID. The cleanupOnce sync.Once introduced at pam-proxy.go:166 is captured in HandlePAMProxy's closure — it has no visibility into the cancellation goroutine, so it does not gate the second invocation.

Why CleanupPAMSession is not race-safe (packages/pam/session/uploader.go:690-768): it has no per-session lock guarding the full sequence — RegisterSession (if missing) → flushSession → optional legacy bulk upload → UnregisterSessionchunkUploader.ReconcileSession + CleanupSessiondeletePersistedOffset + os.Remove of the recording file → credentialsManager.CleanupSessionCredentialsCleanupSessionMutexapi.CallPAMSessionTermination. The sessionMutex in logger.go only protects per-session file writes, not the cleanup pipeline.

Concrete fallout:

  • Two POST /v1/pam/sessions/{id}/end calls hit the platform — server-visible duplicate termination.
  • flushSession runs twice; the second call may upload events twice or hit a session being re-registered.
  • TOCTOU at uploader.go:695-700: A may UnregisterSession after B's RUnlock but before B's RegisterSession, so B re-creates state for a session A just removed; both then UnregisterSession again.
  • os.Remove race on the recording file (handled gracefully via os.IsNotExist, but symptomatic of the race).
  • Credentials cleanup races, and the audit log misleadingly attributes the termination to connection_closed for what was a user-initiated cancellation (whichever wins logs its reason).

Pre-PR behavior (verified at HEAD~ — commit 71aba63): HandlePAMProxy had no deferred cleanup. On cancellation, only HandlePAMCancellation called CleanupPAMSession, so the cancellation path had a single cleaner. The PR added the defer to fix the orthogonal "no cleanup on normal disconnect" gap, but inadvertently introduced this concurrency regression on the cancellation path.

Step-by-step proof:

  1. T=0: SSH session active. HandlePAMProxy goroutine G_proxy is blocked inside proxy.HandleConnection(ctx, handlerConn). cleanupOnce (in G_proxy's frame) is unfired.
  2. T=1: User cancels. New channel arrives with ALPN infisical-pam-session-cancellation; handleIncomingChannel dispatches on goroutine G_cancel. Gateway.CancelPAMSession runs at gateway.go:179: removes the entry from pamSessions, then e.conn.Close() + e.cancel().
  3. T=2: G_cancel proceeds to pam-proxy.go:105 and calls SessionUploader.CleanupPAMSession(sessionId, "cancellation"). Enters the function, takes RLock, sees the session is registered, releases RLock, calls flushSession.
  4. T=2 (concurrent): The closed tlsConn causes proxy.HandleConnection to return in G_proxy. G_proxy starts running deferred functions. The defer at pam-proxy.go:174 fires: cleanupSession("connection_closed")cleanupOnce.Do(...)Do enters because cleanupOnce has not been fired in this frame. It calls SessionUploader.CleanupPAMSession(sessionId, "connection_closed").
  5. T=3: Both G_cancel and G_proxy are now inside CleanupPAMSession(sessionId, ...). They race on flushSession, UnregisterSession, file delete, and api.CallPAMSessionTermination. The server-side end endpoint receives two POSTs for the same session.
  6. Whichever finishes last logs its reason; the audit trail shows the wrong reason approximately half the time.

Fix options:

  • Share the once: pass cleanupSession (or a *sync.Once keyed on sessionID) into HandlePAMCancellation so the cancellation goroutine and the proxy goroutine's defer share gate state. Simplest is to skip the explicit CleanupPAMSession in HandlePAMCancellation and rely on the proxy's deferred cleanup once the conn close trips it (the cancellation goroutine just has to ensure the conn is closed and cancel the ctx).
  • Make CleanupPAMSession idempotent at the uploader: maintain an in-progress/completed set under a mutex, return early on second entry for the same session.

The first option is cheapest and matches the existing design (single owner per session lifecycle).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant