Conversation
|
💬 Discussion in Slack: #pr-review-cli-223-fix-add-keepalive-idle-timeout-and-session-cleanup-to-gateway Posted by Review Police — reviews, comments, new commits, and CI failures will stream into this channel. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 71aba63cc3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
| }) | ||
| } | ||
| defer cleanupSession("connection_closed") |
There was a problem hiding this comment.
Avoid terminating PAM session on single connection close
When a PAM session has multiple connections under the same session ID (the gateway registry explicitly stores []*pamSessionEntry, and MongoDB/mongosh retries share a session-level proxy), each HandlePAMProxy invocation gets its own cleanupOnce; the first connection that returns now runs this defer and calls CleanupPAMSession. That cleanup unregisters the session, may delete the recording/secrets, and calls the termination API (packages/pam/session/uploader.go:736-763) while sibling connections are still active or while the client is about to open another connection, so multi-connection PAM sessions can be marked ended and lose credentials/log state prematurely.
Useful? React with 👍 / 👎.
| for { | ||
| n, err := src.Read(buf) | ||
| if n > 0 { | ||
| if p.config.OnActivity != nil { | ||
| p.config.OnActivity() | ||
| } | ||
|
|
||
| // Check if this channel is a binary session (SFTP/SCP) | ||
| chState.mutex.Lock() | ||
| isBinary := chState.isBinarySession |
There was a problem hiding this comment.
🔴 When command blocking is enabled (BlockedCommandPatterns non-empty), client→server keystrokes/data flow through proxyClientToServerWithBlocking which never invokes p.config.OnActivity(), so they don't reset the gateway's idle reaper. SSH bypasses the activityConn wrapper (pam-proxy.go:386 passes raw conn, not handlerConn), making OnActivity the only activity-tracking path for SSH; sessions with continuous client-only data (e.g., ssh host 'cat > file' piping a long upload, paste/non-echoing modes, exec commands consuming stdin silently) can be reaped after 30 minutes despite active client traffic. Fix: mirror the n>0 nil-check from proxyData (proxy.go:527-530) at the top of the n>0 branch in proxyClientToServerWithBlocking.
Extended reasoning...
What the bug is
proxyClientToServerWithBlocking (proxy.go:759-862, replacing the call site at proxy.go:773) reads bytes from the client channel and forwards them to the server, but never invokes p.config.OnActivity() on either the binary (isBinary) branch or the interactive branch. Its non-blocking sibling proxyData does call OnActivity at proxy.go:527-530, immediately after src.Read returns n>0. The result is asymmetric activity tracking depending on whether command blocking is configured for a session.
Why this matters specifically for SSH
In pam-proxy.go the SSH branch passes the raw TLS conn to proxy.HandleConnection (line 386: return proxy.HandleConnection(ctx, conn)), not the handlerConn activityConn wrapper that other resource types use. Every other resource (postgres/mysql/mssql/redis/kubernetes/mongodb) gets handlerConn, so the wrapper's Read/Write paths fire OnActivity even if the inner protocol handler doesn't. SSH has no such fallback — OnActivity has to be invoked by the SSH proxy code itself. proxyData does that for both directions when blocking is off; with blocking on, only the server→client direction (still using proxyData) does, while client→server goes silent.
Code path triggering the bug
- Operator configures a PAM policy with
CommandBlocking.Patternsnon-empty. HandlePAMProxycompiles those intoBlockedCommandPatternsand constructs the SSH proxy with them set (pam-proxy.go:362).- In
handleChannel(proxy.go around line 234),len(p.config.BlockedCommandPatterns) > 0selectsproxyClientToServerWithBlockingfor the client→server goroutine. - The function loops on
src.Read(buf), parses/forwards bytes, but never touches OnActivity.lastActivityfor that session entry is updated only when server→client traffic flows.
Why existing code doesn't prevent it
- The
activityConnwrapper added in this PR would have caught it via underlying-conn reads/writes, but SSH is the one resource type that intentionally bypasses the wrapper (probably because the SSH proxy already has its own per-direction read loops). - The server→client
proxyDatagoroutine does fire OnActivity, which masks the bug for typical interactive shells: terminal echo means every client keystroke triggers a server→client byte that resetslastActivity. So the 'leave a shell idle for 30+ minutes' test in this PR's test plan still fires the reaper correctly, and the 'type commands and stay alive' test stays alive — both because of echo.
Impact
Idle-reaper false-positives during legitimate continuous client activity:
ssh host 'cat > /tmp/upload'with the client piping stdin from a large file: client→server data flows continuously, server is silent until EOF. Reaped at 30 min mid-upload.- Similar patterns:
ssh host 'tar xf - -C /dst',ssh host 'mysql' < big.sql, scripted pipelines that feed stdin to a remote command without intermediate output. - Long paste into a non-echoing prompt (password entry, sudo prompts), though usually too short to actually trigger.
When the reaper fires, CancelPAMSession closes the TLS conn and cancels the context, dropping in-flight uploads.
Fix
Mirror proxyData's pattern. Add at the top of the n > 0 branch in proxyClientToServerWithBlocking (proxy.go ~line 766):
if p.config.OnActivity != nil {
p.config.OnActivity()
}This is exactly the snippet at proxy.go:527-530 in proxyData.
Step-by-step proof
Setup: PAM policy has commandBlocking.patterns = ['^rm\s'] (any non-empty list triggers the path). User runs cat /large/local/file | ssh -o ServerAliveInterval=0 user@host 'cat > /remote/file' through the gateway. Client side keeps the SSH session quiet on stdout/stderr — only stdin flows to the server.
- T=0: Session registers via
RegisterPAMSession;lastActivityset to now (gateway.go:177). - T=0+:
HandlePAMProxyenters the SSH branch, buildsSSHProxyConfigwithBlockedCommandPatternsnon-empty andOnActivity = pamConfig.OnActivity, callsproxy.HandleConnection(ctx, conn)— noteconn, nothandlerConn. handleChannelopens the session channel and dispatches the goroutines. Because patterns are non-empty, client→server takesproxyClientToServerWithBlocking.- Client streams 100 MB to stdin of the remote
cat.src.Readreturnsn>0repeatedly. The bytes are forwarded todst.Write(buf[:n])(binary branch, since this is exec stdin) butOnActivityis never called. - Remote
cat > /remote/fileproduces no stdout/stderr until EOF, soproxyData(server→client) reads no data and does not call OnActivity either. - T=30 min:
reapIdleSessions(gateway.go:295) seeslastActivitystill ≈ T=0, marks the session stale, callsCancelPAMSessionwhich callse.conn.Close()ande.cancel(). - The TLS conn closing propagates to the SSH proxy; the upload aborts mid-stream and the gateway logs "Reaping idle PAM session" despite continuous traffic.
With the one-line fix, step 4 calls OnActivity on every read, lastActivity advances each iteration, and the reaper leaves the session alone.
Why this doesn't show up in the listed test plan
Test 3 ("leave it idle for 30+ minutes") works as expected — no traffic, reaper fires.
Test 4 ("type commands for a few minutes") survives because terminal echo means every keystroke produces a server→client byte that calls OnActivity via proxyData. The asymmetry only manifests in non-echoing client→server-dominated traffic, which isn't covered.
| // Cleanup must run exactly once regardless of how the session ends | ||
| var cleanupOnce sync.Once | ||
| cleanupSession := func(reason string) { | ||
| cleanupOnce.Do(func() { | ||
| if err := pamConfig.SessionUploader.CleanupPAMSession(pamConfig.SessionId, reason); err != nil { | ||
| log.Error().Err(err).Str("sessionId", pamConfig.SessionId).Str("reason", reason).Msg("Failed to cleanup PAM session") | ||
| } | ||
| }) | ||
| } | ||
| defer cleanupSession("connection_closed") |
There was a problem hiding this comment.
🔴 Cancellation now triggers two concurrent CleanupPAMSession calls for the same session: HandlePAMCancellation invokes it directly with reason "cancellation", while the new defer cleanupSession("connection_closed") in HandlePAMProxy fires when the cancellation closes the conn. The cleanupOnce is per-HandlePAMProxy-invocation and cannot coordinate with the separate cancellation goroutine, so the server sees two POST /v1/pam/sessions/{id}/end calls and the cleanup pipeline (flush, unregister/re-register, file delete, credentials cleanup) races with itself. Fix by sharing the once with HandlePAMCancellation (e.g. pass cleanupSession through, or skip the cancellation-side cleanup when the proxy goroutine is active), or make CleanupPAMSession idempotent at the uploader level.
Extended reasoning...
Bug: The new defer cleanupSession("connection_closed") at packages/pam/pam-proxy.go:174 regresses the cancellation flow by causing two concurrent CleanupPAMSession calls for the same session.
Cancellation path (after this PR):
- A user-initiated cancellation arrives over ALPN
infisical-pam-session-cancellation.gateway.go::handleIncomingChanneldispatches each channel on its own goroutine (go g.handleIncomingChannel(newChannel)), so cancellation runs concurrently with the activeForwardModePAMgoroutine. pam.HandlePAMCancellation(pam-proxy.go:95) callscancelSession(sessionId)→Gateway.CancelPAMSession, which closes the active session'stlsConn(e.conn.Close()) and cancels itssessionCtx(e.cancel()).- It then synchronously calls
pamConfig.SessionUploader.CleanupPAMSession(sessionId, "cancellation")atpam-proxy.go:105. - In parallel, closing the
tlsConnwakes the activeHandlePAMProxygoroutine —proxy.HandleConnectionreturns and the new deferredcleanupSession("connection_closed")(pam-proxy.go:174) fires, callingCleanupPAMSession(sessionId, "connection_closed").
Two independent goroutines now enter SessionUploader.CleanupPAMSession for the same sessionID. The cleanupOnce sync.Once introduced at pam-proxy.go:166 is captured in HandlePAMProxy's closure — it has no visibility into the cancellation goroutine, so it does not gate the second invocation.
Why CleanupPAMSession is not race-safe (packages/pam/session/uploader.go:690-768): it has no per-session lock guarding the full sequence — RegisterSession (if missing) → flushSession → optional legacy bulk upload → UnregisterSession → chunkUploader.ReconcileSession + CleanupSession → deletePersistedOffset + os.Remove of the recording file → credentialsManager.CleanupSessionCredentials → CleanupSessionMutex → api.CallPAMSessionTermination. The sessionMutex in logger.go only protects per-session file writes, not the cleanup pipeline.
Concrete fallout:
- Two
POST /v1/pam/sessions/{id}/endcalls hit the platform — server-visible duplicate termination. flushSessionruns twice; the second call may upload events twice or hit a session being re-registered.- TOCTOU at
uploader.go:695-700: A mayUnregisterSessionafter B'sRUnlockbut before B'sRegisterSession, so B re-creates state for a session A just removed; both thenUnregisterSessionagain. os.Removerace on the recording file (handled gracefully viaos.IsNotExist, but symptomatic of the race).- Credentials cleanup races, and the audit log misleadingly attributes the termination to
connection_closedfor what was a user-initiated cancellation (whichever wins logs its reason).
Pre-PR behavior (verified at HEAD~ — commit 71aba63): HandlePAMProxy had no deferred cleanup. On cancellation, only HandlePAMCancellation called CleanupPAMSession, so the cancellation path had a single cleaner. The PR added the defer to fix the orthogonal "no cleanup on normal disconnect" gap, but inadvertently introduced this concurrency regression on the cancellation path.
Step-by-step proof:
- T=0: SSH session active.
HandlePAMProxygoroutine G_proxy is blocked insideproxy.HandleConnection(ctx, handlerConn).cleanupOnce(in G_proxy's frame) is unfired. - T=1: User cancels. New channel arrives with ALPN
infisical-pam-session-cancellation;handleIncomingChanneldispatches on goroutine G_cancel.Gateway.CancelPAMSessionruns atgateway.go:179: removes the entry frompamSessions, thene.conn.Close()+e.cancel(). - T=2: G_cancel proceeds to
pam-proxy.go:105and callsSessionUploader.CleanupPAMSession(sessionId, "cancellation"). Enters the function, takesRLock, sees the session is registered, releasesRLock, callsflushSession. - T=2 (concurrent): The closed
tlsConncausesproxy.HandleConnectionto return in G_proxy. G_proxy starts running deferred functions. The defer atpam-proxy.go:174fires:cleanupSession("connection_closed")→cleanupOnce.Do(...)—Doenters becausecleanupOncehas not been fired in this frame. It callsSessionUploader.CleanupPAMSession(sessionId, "connection_closed"). - T=3: Both G_cancel and G_proxy are now inside
CleanupPAMSession(sessionId, ...). They race onflushSession,UnregisterSession, file delete, andapi.CallPAMSessionTermination. The server-side end endpoint receives two POSTs for the same session. - Whichever finishes last logs its reason; the audit trail shows the wrong reason approximately half the time.
Fix options:
- Share the once: pass
cleanupSession(or a*sync.Oncekeyed on sessionID) intoHandlePAMCancellationso the cancellation goroutine and the proxy goroutine's defer share gate state. Simplest is to skip the explicitCleanupPAMSessioninHandlePAMCancellationand rely on the proxy's deferred cleanup once the conn close trips it (the cancellation goroutine just has to ensure the conn is closed and cancel the ctx). - Make
CleanupPAMSessionidempotent at the uploader: maintain an in-progress/completed set under a mutex, return early on second entry for the same session.
The first option is cheapest and matches the existing design (single owner per session lifecycle).
Description 📣
Add keepalive, idle timeout, and session cleanup to gateway
Type ✨
Tests 🛠️
kill -STOPthe client process, watch gateway logs for"SSH keepalive to client failed"within ~45s, confirm session flips to inactive in the UIkill -STOPthe relay process while a session is active, watch gateway logs for"Relay SSH keepalive failed"within ~45s, thenkill -CONTthe relay and confirm the gateway reconnectspamIdleTimeoutto2 * time.Minuteingateway.go:278), watch for"Reaping idle PAM session", confirm session flips to inactive