fix(relay): stop cloning whole slots under the lock + leaking dead SSE subs#347
Merged
Conversation
…E subs The actual root cause of the read-path saturation #342 mitigated by raising the fly concurrency ceiling. Three relay_server.rs contention/leak fixes from the bug-hunt's relay-locks + sse-stream dimensions. #2 (HIGH) list_events `.cloned()`'d the ENTIRE slot Vec (bounded only by MAX_SLOT_BYTES = 64 MiB) under the single global mutex on every pull, then sliced. Under the concurrent `?limit=1000` pulls that caused the outage, each pull serialized a multi-MB memcpy under the lock that post_event and every other handler contend on. Now borrows the slot, clones ONLY the `[start..end]` window (<= limit <= 1000 events), and drops the lock before serializing. unix-now is also computed before taking the lock. #4 disconnected SSE subscribers leaked: post_event prunes dead senders only lazily on its next broadcast, so a slot that goes silent kept dead senders forever (and over-counted them against MAX_STREAMS_PER_SLOT). stream_events now prunes at admission via `retain_live_subscribers` (tx.is_closed()). #13 empty per-slot Vecs in `streams` were never removed → one map key per ever-streamed slot. post_event + stream_events now drop the key when it empties. (#8 — broadcast cloning per-subscriber under the lock — deferred: the safe fix needs an Arc channel-type change / lock dance with event-loss edges, and it's dominated by #2.) Test `retain_live_subscribers_drops_disconnected`; the 21 relay integration tests (list_events round-trips + SSE) confirm #2 is behaviour-preserving. 607 lib tests; fmt + clippy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Deploying wireup-landing with
|
| Latest commit: |
d6729ee
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://68bd302a.wireup-landing.pages.dev |
| Branch Preview URL: | https://fix-relay-saturation-root.wireup-landing.pages.dev |
This was referenced Jun 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The actual root cause of the read-path saturation that #342 only mitigated (by raising the fly concurrency ceiling). Three
relay_server.rscontention/leak fixes — bug-hunt relay-locks + sse-stream dimensions.#2 — HIGH:
list_eventscloned the whole slot under the global lockinner.slots.get(&slot_id).cloned()deep-cloned the entire slot Vec (bounded only byMAX_SLOT_BYTES= 64 MiB) under the single global mutex on every pull, then sliced. Under the concurrent?limit=1000pulls that caused the outage, each pull serialized a multi-MB memcpy under the lock thatpost_eventand every other handler block on. Now borrows the slot, clones only the[start..end]window (≤ limit ≤ 1000), and drops the lock before serializing.#4 — disconnected SSE subscribers leaked
post_eventprunes dead senders only lazily on its next broadcast → a slot that goes silent kept dead senders forever (and over-counted them againstMAX_STREAMS_PER_SLOT).stream_eventsnow prunes at admission viaretain_live_subscribers(tx.is_closed()).#13 — empty per-slot
streamskeys never removedOne map key accumulated per ever-streamed slot.
post_event+stream_eventsnow drop the key when it empties.Deferred
#8 (broadcast cloning per-subscriber under the lock) — the safe fix needs an
Arcchannel-type change / lock dance with event-loss edge cases, and it's dominated by #2. Tracked separately.Verify
retain_live_subscribers_drops_disconnected; the 21 relay integration tests (list_events round-trips + SSE) confirm #2 is behaviour-preserving.cargo test --lib→ 607; fmt + clippy clean.🤖 Generated with Claude Code