Skip to content

Fix silent stalls in native_subscribe WS subscriptions (Solana, eth pending tx)#827

Open
a10zn8 wants to merge 2 commits into
masterfrom
fix-solana-ws
Open

Fix silent stalls in native_subscribe WS subscriptions (Solana, eth pending tx)#827
a10zn8 wants to merge 2 commits into
masterfrom
fix-solana-ws

Conversation

@a10zn8
Copy link
Copy Markdown
Contributor

@a10zn8 a10zn8 commented May 12, 2026

Summary

  • Surface idle WS subscription stalls as TimeoutException (instead of silent completion) so the wrapping DurableFlux re-issues subscribe with a fresh subId.
  • Stop swallowing subscription errors with onErrorResume { Mono.empty() } — let them propagate to DurableFlux, which already handles reconnect with exponential backoff.
  • Applied to both subscription paths: GenericSubscriptionConnect (Solana slotSubscribe, plus all chains that go through GenericIngressSubscription: Polkadot, Ripple, etc.) and WebsocketPendingTxes (Ethereum newPendingTransactions).

Root cause

Solana RPC nodes commonly stop delivering events on a specific subscription while keeping the WS TCP/frame layer fully alive (no ping/pong on our side either — WsConnectionImpl runs with handlePing(false) and no idle handler). The previous code in GenericSubscriptionConnect.createConnection() had:

.timeout(Duration.ofSeconds(85), Mono.empty<ByteArray>().doOnEach { log.warn(...) })
.onErrorResume { log.error(...); Mono.empty() }

When the upstream stalled for 85 s:

  1. .timeout(d, Mono.empty()) switched to an empty fallback → Flux completed (no error).
  2. DurableFlux.connect() only retries on onError, not onComplete → no resubscribe.
  3. SharedFluxHolder propagated onComplete to all downstream subscribers and cleared current.
  4. The 30 s heartbeat in NativeSubscribe.nativeSubscribe kept firing → the client connection looked healthy but never received another data event.

WebsocketPendingTxes.createConnection() had the identical anti-pattern.

The healthy reference is GenericWsHead.listenNewHeads() which already uses .timeout(d, Mono.error(...)) and resubscribes on the error.

Test plan

  • GenericSubscriptionConnectTest.emits TimeoutException when upstream goes silent past idle timeout — virtual-time test pinning the new error signal.
  • GenericSubscriptionConnectTest.idle timeout triggers resubscribe via DurableFlux retry — end-to-end: first subscription stalls (Flux.never), virtual time advances 90 s, second subscribe() delivers a "recovered" event; verifies subscribe(...) was called twice.
  • WebsocketPendingTxesSpec.Emits TimeoutException when upstream goes silent past idle timeout — same pattern for the Ethereum pending-tx path.
  • Existing Produces values test still passes (happy path unchanged).
  • ./gradlew compileKotlin compileTestKotlin clean.

Follow-up (out of scope)

WsConnectionImpl has no WS-level ping/idle handler, so a half-open TCP connection still relies on OS TCP timeouts (~minutes). With this PR, the subscription layer at least heals on its own; the connection layer is a separate hardening task.

🤖 Generated with Claude Code

a10zn8 and others added 2 commits May 12, 2026 17:13
…ing tx)

When an upstream WebSocket subscription stops emitting events while the
TCP/WS connection stays alive (common on Solana RPCs), the subscription
Flux silently completed: `.timeout(85s, Mono.empty())` switched to an
empty fallback (no error), and `.onErrorResume { Mono.empty() }` swallowed
any real errors. The wrapping DurableFlux only reconnects on `onError`,
so no resubscribe happened; clients kept receiving heartbeats from
NativeSubscribe but no data.

Surface stalls as TimeoutException and let real errors propagate so
DurableFlux re-invokes createConnection() and issues a fresh subscribe
with a new subId. Fixes both:
- GenericSubscriptionConnect (Solana / Polkadot / Ripple / generic WS subs)
- WebsocketPendingTxes (Ethereum newPendingTransactions)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant