Skip to content

fix(relay): stop cloning whole slots under the lock + leaking dead SSE subs#347

Merged
laulpogan merged 2 commits into
mainfrom
fix/relay-saturation-root
Jun 20, 2026
Merged

fix(relay): stop cloning whole slots under the lock + leaking dead SSE subs#347
laulpogan merged 2 commits into
mainfrom
fix/relay-saturation-root

Conversation

@laulpogan

Copy link
Copy Markdown
Collaborator

The actual root cause of the read-path saturation that #342 only mitigated (by raising the fly concurrency ceiling). Three relay_server.rs contention/leak fixes — bug-hunt relay-locks + sse-stream dimensions.

⚠️ Deploys to the live wireup.net relay on merge (via fly-deploy.yml).

#2 — HIGH: list_events cloned the whole slot under the global lock

inner.slots.get(&slot_id).cloned() deep-cloned the entire slot Vec (bounded only by MAX_SLOT_BYTES = 64 MiB) under the single global mutex on every pull, then sliced. Under the concurrent ?limit=1000 pulls that caused the outage, each pull serialized a multi-MB memcpy under the lock that post_event and every other handler block on. Now borrows the slot, clones only the [start..end] window (≤ limit ≤ 1000), and drops the lock before serializing.

#4 — disconnected SSE subscribers leaked

post_event prunes dead senders only lazily on its next broadcast → a slot that goes silent kept dead senders forever (and over-counted them against MAX_STREAMS_PER_SLOT). stream_events now prunes at admission via retain_live_subscribers (tx.is_closed()).

#13 — empty per-slot streams keys never removed

One map key accumulated per ever-streamed slot. post_event + stream_events now drop the key when it empties.

Deferred

#8 (broadcast cloning per-subscriber under the lock) — the safe fix needs an Arc channel-type change / lock dance with event-loss edge cases, and it's dominated by #2. Tracked separately.

Verify

retain_live_subscribers_drops_disconnected; the 21 relay integration tests (list_events round-trips + SSE) confirm #2 is behaviour-preserving. cargo test --lib → 607; fmt + clippy clean.

🤖 Generated with Claude Code

…E subs

The actual root cause of the read-path saturation #342 mitigated by raising the
fly concurrency ceiling. Three relay_server.rs contention/leak fixes from the
bug-hunt's relay-locks + sse-stream dimensions.

#2 (HIGH) list_events `.cloned()`'d the ENTIRE slot Vec (bounded only by
MAX_SLOT_BYTES = 64 MiB) under the single global mutex on every pull, then
sliced. Under the concurrent `?limit=1000` pulls that caused the outage, each
pull serialized a multi-MB memcpy under the lock that post_event and every other
handler contend on. Now borrows the slot, clones ONLY the `[start..end]` window
(<= limit <= 1000 events), and drops the lock before serializing. unix-now is
also computed before taking the lock.

#4 disconnected SSE subscribers leaked: post_event prunes dead senders only
lazily on its next broadcast, so a slot that goes silent kept dead senders
forever (and over-counted them against MAX_STREAMS_PER_SLOT). stream_events now
prunes at admission via `retain_live_subscribers` (tx.is_closed()).

#13 empty per-slot Vecs in `streams` were never removed → one map key per
ever-streamed slot. post_event + stream_events now drop the key when it empties.

(#8 — broadcast cloning per-subscriber under the lock — deferred: the safe fix
needs an Arc channel-type change / lock dance with event-loss edges, and it's
dominated by #2.)

Test `retain_live_subscribers_drops_disconnected`; the 21 relay integration
tests (list_events round-trips + SSE) confirm #2 is behaviour-preserving.
607 lib tests; fmt + clippy clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 20, 2026

Copy link
Copy Markdown

Deploying wireup-landing with  Cloudflare Pages  Cloudflare Pages

Latest commit: d6729ee
Status: ✅  Deploy successful!
Preview URL: https://68bd302a.wireup-landing.pages.dev
Branch Preview URL: https://fix-relay-saturation-root.wireup-landing.pages.dev

View logs

@laulpogan laulpogan merged commit 8904fe3 into main Jun 20, 2026
13 checks passed
@laulpogan laulpogan deleted the fix/relay-saturation-root branch June 20, 2026 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant