Skip to content

fix: KEEP-569 Detect upstream WebSocket silence on proxied subscriptions#35

Open
suisuss wants to merge 2 commits into
project-aethermesh:mainfrom
suisuss:fix/improve-ws-upstream-idle-detection
Open

fix: KEEP-569 Detect upstream WebSocket silence on proxied subscriptions#35
suisuss wants to merge 2 commits into
project-aethermesh:mainfrom
suisuss:fix/improve-ws-upstream-idle-detection

Conversation

@suisuss
Copy link
Copy Markdown

@suisuss suisuss commented May 15, 2026

Problem

Long-lived WebSocket subscriptions through aetherlay (e.g. eth_subscribe for newHeads) could stop delivering messages while the proxied connection stayed open. Clients ended up with a valid socket and no signal to reconnect — block notifications silently stopped flowing.

Root cause: the proxy was a transparent byte-pipe with no liveness detection.

  • defaultProxyWebSocket had no read deadline on either conn, so ReadMessage() would block forever on a half-open upstream TCP.
  • gorilla/websocket auto-pongs inbound pings inside ReadMessage by default. A client ping sent to aetherlay was answered locally by aetherlay rather than reaching the upstream, so the client's ping/pong loop only proved the client↔aetherlay leg was alive — it said nothing about aetherlay↔upstream.
  • The recent close-frame handling fixes (fix: Improve how websockets deal with timeouts and closed connections #33) only fire on graceful close. They cannot detect silent failure.

Approach

Two changes, both in internal/server/server.go:

  1. Forward WS control frames end-to-end. Override SetPingHandler / SetPongHandler on both legs so a client ping reaches the upstream RPC node and the upstream's pong reaches the client, payload bytes preserved. Aetherlay generates no pings of its own — it relies on whatever ping cadence the client (or upstream) chooses.
  2. Passive idle-timeout backstop on the upstream conn. SetReadDeadline(now + 90s) on the backend conn, reset on every received frame (data, ping, pong). If upstream sends nothing for 90s, ReadMessage returns a timeout, both legs are torn down, and the client receives a close. This catches the case where the client doesn't ping or where forwarded pings are absorbed somewhere upstream.

The teardown path explicitly marks the endpoint unhealthy on a net.Error.Timeout() from the backend leg (timeouts can only originate from the backend conn since the client conn has no deadline, so attribution is unambiguous).

Verification

Tested ping/pong against every reachable upstream WSS in production (Alchemy, dRPC, QuickNode, Tempo, Solana). All replied within 100-355ms with the original payload echoed, confirming forwarding works end-to-end (or at least far past the CDN edge).

Two new tests in internal/server/server_test.go:

  • TestProxyWebSocket_UpstreamIdleTimeout — silent upstream → client torn down within wsIdleTimeout (overridden to 300ms for the test) and endpoint marked unhealthy.
  • TestProxyWebSocket_PingPongForwarded — asserts a client ping reaches the upstream with the original payload and the upstream's pong reaches the client unchanged.

Full test suite + go vet clean.

Tunables

wsIdleTimeout = 90 * time.Second and wsControlWriteTimeout = 10 * time.Second are package-level vars so tests can override them. Hard-coded for now; can promote to env vars if needed.

Test plan

  • CI passes
  • Deploy to staging and verify a long-running eth_subscribe survives a forced upstream stall (block source paused at the provider) and triggers client reconnect within ~90s

Summary by CodeRabbit

  • Bug Fixes

    • WebSocket ping/pong control frames are now properly forwarded between clients and servers.
    • Idle WebSocket connections are automatically closed after timeout.
    • Upstream servers that stop responding are detected and marked unhealthy.
  • Tests

    • Added regression tests for upstream idle timeout and ping/pong forwarding.

Review Change Stack

Long-lived WebSocket subscriptions (eth_subscribe newHeads) could stop
delivering messages while the proxied connection stayed open, leaving
clients with a valid socket and no signal to reconnect. The previous
proxy used a transparent byte-pipe with no read deadline and let gorilla
auto-pong inbound pings, so a half-open upstream TCP or a wedged
provider subscription dispatcher never surfaced as an error on either
leg.

This change:

- Forwards WS ping/pong frames end-to-end in both directions instead of
  auto-ponging at the proxy. A client ping reaches the upstream RPC
  node and the upstream's pong reaches the client, with payload bytes
  preserved for any client-side correlation.
- Sets a 90s read deadline on the upstream conn that resets on every
  received frame (data, ping, pong). If upstream stops sending anything
  for that long, ReadMessage returns a timeout, both legs are torn
  down, and the client receives a close so its existing reconnect
  logic can run.
- Marks the endpoint unhealthy on idle-timeout teardown, since the
  read deadline only fires on the backend conn so the timeout is
  unambiguously an upstream failure.

Tests added for the upstream-silence teardown path and for end-to-end
ping/pong forwarding (asserts payload round-trips, proving the proxy
is not auto-replying locally).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Walkthrough

Adds WebSocket liveness detection and control-frame forwarding to the proxy path. Introduces configurable timeouts (wsIdleTimeout, wsWriteTimeout, wsControlWriteTimeout) and a wsBackendIdleError to mark backend read-deadline timeouts. Implements installControlForwarders to forward ping/pong frames and extends proxyWebSocketCopy with an optional bumpSrcDeadline behavior. defaultProxyWebSocket seeds backend deadlines, installs asymmetric control forwarders, proxies both directions concurrently, and treats backend idle timeouts as endpoint unhealthy. Two tests cover upstream idle timeout and ping/pong forwarding.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The PR title accurately summarizes the main change: adding WebSocket silence detection for proxied subscriptions to fix upstream connectivity issues (KEEP-569).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/server/server_test.go`:
- Around line 644-650: The websocket.DefaultDialer.Dial call returns
(*websocket.Conn, *http.Response, error) and the response body must be closed to
avoid leaking resources; update the Dial call site where proxyWSURL is used to
capture the second return value (resp), and if resp != nil ensure
resp.Body.Close() is called (e.g. defer resp.Body.Close() after a successful
dial or close it immediately on dial error) so that the http.Response from
websocket.DefaultDialer.Dial is always closed.
- Around line 548-552: websocket.DefaultDialer.Dial currently discards the
*http.Response; change the assignment to capture it (e.g., client, resp, err :=
websocket.DefaultDialer.Dial(...)), and if resp != nil defer resp.Body.Close()
so the response body is always closed (do this before returning or calling
t.Fatalf on err) to prevent leaking resources; reference the Dial call, the
client variable, the resp variable, and the existing t.Fatalf error handling
when making the change.

In `@internal/server/server.go`:
- Around line 1249-1252: The code currently bumps src's read deadline (using
src.SetReadDeadline with wsIdleTimeout) then performs an unbounded
dst.WriteMessage which can block and cause the next Read to immediately time
out; fix this in defaultProxyWebSocket by setting a separate, shorter write
deadline on dst before calling WriteMessage (use
dst.SetWriteDeadline(time.Now().Add(wsWriteTimeout)) where wsWriteTimeout <
wsIdleTimeout) and clear/reset it afterwards; additionally, when handling errors
from WriteMessage, treat write-timeout errors (net.Error with Timeout()) as
downstream backpressure and avoid marking the upstream as unhealthy—only treat
non-timeout write failures as endpoint failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d56cdab3-2c59-4590-91f6-ecc4be9b08cc

📥 Commits

Reviewing files that changed from the base of the PR and between 59f3d52 and 7e59ce4.

📒 Files selected for processing (2)
  • internal/server/server.go
  • internal/server/server_test.go

Comment thread internal/server/server_test.go Outdated
Comment thread internal/server/server_test.go
Comment thread internal/server/server.go Outdated
Addresses CodeRabbit feedback on PR project-aethermesh#35.

Without a write deadline, dst.WriteMessage on the backend leg could block
arbitrarily long when the client was slow draining its socket. Bumping the
src read deadline before the write meant the next ReadMessage would time
out instantly after a slow write completed, marking a healthy upstream
unhealthy.

Three changes in proxyWebSocketCopy / defaultProxyWebSocket:

- Cap each forwarded write with wsWriteTimeout (30s) via SetWriteDeadline,
  so a slow peer can't stall the proxy.
- Move the src read-deadline bump to AFTER the write succeeds. The
  semantic becomes "no upstream activity since the last successful
  end-to-end forward," which removes the false positive when downstream
  is slow.
- Tag backend-leg read-deadline timeouts with a wsBackendIdleError
  sentinel. Only this error marks the upstream endpoint unhealthy. Other
  timeouts (write deadlines, control-frame forwarding failures, client
  leg read errors) fall through to the existing isExpectedWSClose path
  and don't cause endpoint health changes.

Also closed *http.Response bodies returned from websocket.DefaultDialer.Dial
in the two new tests (golangci-lint bodyclose).
@suisuss
Copy link
Copy Markdown
Author

suisuss commented May 15, 2026

@sanbotto please review

@suisuss suisuss changed the title fix: Detect upstream WebSocket silence on proxied subscriptions fix: KEEP-569 Detect upstream WebSocket silence on proxied subscriptions May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant