fix(storage_handler): honor QueryTimeout at the send<-msg boundary#76
Merged
Merged
Conversation
Before this change, a REQ could stall for hours despite a configured
QueryTimeout: production logs on salmon showed subscriptions with
send_wait_ms > 3h while conn.Write stayed at ~5 ms per call and
ping/pong continued to round-trip. Traced to a context scoping bug.
The prior implementation wrapped ctx.WithTimeout inside handleReq's
iter.Seq closure, so the timeout ctx was *local* to that closure. The
closure yielded events to SimpleHandlerBase's ServeNostr, which
forwarded them to send<-msg under the *outer* (un-timed) ctx. Once a
receiver stalled, the outer ctx had no deadline and the inner scan
loop's ctx.Err check never ran again (it's only polled at event
boundaries, and yield blocks before the next event is fetched). Result:
QueryTimeout only covered Query's scan phase, not the send-wait phase
that the option was specifically introduced to address.
This commit reimplements StorageHandler as a direct Handler (not
SimpleHandlerBase) so it can own its ServeNostr loop and attach
queryCtx to the exact send sites that can stall. The REQ / COUNT flow
derives queryCtx = ctx.WithTimeout(ctx, QueryTimeout) and threads it
through:
- storage.Query(queryCtx, filters) — scan respects the deadline.
- select { send<-msg; <-queryCtx.Done() } for every event emission.
- queryCtx.Err() polled at the top of each iteration.
- terminal CLOSED emitted on the outer ctx so it can outlive queryCtx.
When queryCtx fires with the outer ctx still alive, the REQ is aborted
with CLOSED "error: query timeout" as the docs already promised. When
the outer ctx is cancelled (connection ending), we surface that error
instead so ServeNostr returns cleanly.
Adds TestStorageHandler_QueryTimeout_ReqAbortsOnStalledConsumer which
reproduces the live-but-slow-client case via an unbuffered send: the
handler delivers one event, stalls on the second send; the test
advances the fake clock past QueryTimeout and asserts that the next
message is CLOSED, not another EVENT. Verified failing on main (prior
to this commit) and passing now.
Side effect: StorageHandler now implements Handler directly rather than
routing through SimpleHandlerBase. No public API change — NewStorageHandler
still returns Handler — but the internal wrapping is gone.
Follows up on #73 (QueryTimeout introduction).
🐛 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Sakura 🌸 <noreply@anthropic.com>
…ock spring Documents the "deadlock spring" shape applied to storageHandler. Since #76 reimplemented StorageHandler as a direct Handler (no longer wrapped by SimpleHandlerBase), the CLAUDE.md rule introduced in #78 — "Handlers that implement Handler directly and consume iter.Seq responses must follow the same shape" — applies verbatim: a blocked send<-msg inside handleReq's yield loop must not stop recv drain, otherwise an upstream MergeHandler (childRecvs cap 10) wedges until BroadcastTimeout retires the child. The new test pushes a REQ with two stored events into a handler whose send channel has only one consumer; the second yield blocks on send<-msg. A second client message is then pushed from a helper goroutine and the test asserts that the push completes — i.e. that recv drain continues even while the main loop is parked on the first response's stalled send. Currently FAILS: storageHandler shares one goroutine for recv read and yield forwarding, so drain stops the moment yield blocks. The next commit restructures ServeNostr to mirror simpleHandler's drain-goroutine + cap-10 internal queue and this test should pass as-written. 🐛 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Sakura 🌸 <noreply@anthropic.com>
storageHandler implements Handler directly (since #76) and forwards REQ responses through handleReq's iter.Seq yield loop on the same goroutine that reads recv. A blocked send<-msg therefore stops recv drain entirely. Under an upstream MergeHandler (childRecvs cap 10) this wedges into the "deadlock spring" shape recorded in CLAUDE.md and the diary entry for 2026-04-24: childSends fills, yield blocks, recv stops draining, childRecvs fills, broadcastAll trips BroadcastTimeout, child retired. QueryTimeout (#76) bounds the worst case but does not eliminate it — the symptom still occurs, scaled to the timeout duration, every time a slow client trips the deadline. simpleHandler addressed the same problem in #78. The CLAUDE.md rule "Handlers that implement Handler directly and consume iter.Seq responses must follow the same shape" applies verbatim here. Fix: spawn a dedicated recv-drain goroutine that forwards recv into an internal cap-10 queue (storageHandlerMsgQueueBuffer, mirroring simpleHandlerMsgQueueBuffer and MergeHandler.childRecvs); the main loop consumes from msgQueue, dispatches to handleEvent / handleReq / handleCount, and forwards responses. A blocked yield no longer stops recv drain. Lifecycle: - drain goroutine exits on drainCtx cancel or recv close; recv close closes msgQueue, which drives the main loop to return nil. - main loop exits on ctx cancel, msgQueue close, or any handle* error. - drainCtx is a sub-ctx of the connection ctx with `defer drainCancel()` at the handler scope, so the drain goroutine is reliably reaped on every main-loop return path without depending on caller-side cancellation — matching simpleHandler's self-contained shape (the one tightened in #78 review feedback). Cap 10 is mirrored from simpleHandlerMsgQueueBuffer and MergeHandler.childRecvs so all three layers share the same backpressure budget; only ≥1 is required for correctness. A sustained slow consumer is still backstopped by Relay's 10s write timeout long before cap 10 matters in practice. Now passes TestStorageHandler_DrainsRecvWhileYieldBlocks (added in the preceding commit). All existing StorageHandler tests, the full suite, and `go test -race ./...` continue to pass. 🐛 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Sakura 🌸 <noreply@anthropic.com>
ea131db to
225b560
Compare
…ted semantics Spirits review feedback (🦍 + 🍍 independently): Storage.Query's errFn contract does not promise idempotence across calls. handleReq and handleCount were each invoking errFn() twice — once inside the abort-decision predicate (errors.Is(errFn(), context.DeadlineExceeded)) and once again in the post-loop `if err := errFn(); err != nil` branch. InMemory and Pebble happen to be idempotent today, but a future Storage implementation that surfaces transient errors only on the first call would silently break. Capture the result once into queryErr := errFn() and use it for both the predicate and the subsequent log/branch. No behaviour change for the existing implementations; the contract surface is now defensive against future ones. Spirits review feedback (🦍, low priority): Document the `completed` flag's meaning in maybeLogSlowReq's final call site. The flag is `err == nil` on the EOSE send, so it tracks "the relay finished the subscription cleanly" — not "the client received EOSE", which can't be observed from the relay side. An EOSE that fails because the outer ctx was cancelled mid-send is logged with completed=false, matching the abort path. Tests: TestStorageHandler / full suite / `go test -race ./...` all clean. 🦍🍍 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Sakura 🌸 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
send<-msgboundary, not just during scan.StorageHandleras a directHandler(no longer wrapped bySimpleHandlerBase) so the timeout ctx can cover the exact send sites that stall.send<-msgcan no longer stop recv drain entirely. Without this, the new direct-Handlerform would violate the rule and re-introduce the "deadlock spring" shape for the duration ofQueryTimeoutevery time it fires.TestStorageHandler_QueryTimeout_ReqAbortsOnStalledConsumer— the original regression test (failing onmain, passing here).TestStorageHandler_DrainsRecvWhileYieldBlocks— new recv-drain test, mirrorsTestSimpleHandler_DrainsRecvWhileYieldBlocksfrom fix: decouple recv drain from response forwarding in simpleHandler #78.Why (QueryTimeout scope)
Production logs on salmon (over a recent 24 h window) showed REQ subscriptions with
send_wait_ms > 3 hwhile:conn.Writestayed at ~5 ms per call (no write_timeout metric trips),ping_timeoutnever fired (client kept round-tripping pongs),io.EOF(client-initiated TCP close).That is the exact "live but slow client" case #73 introduced QueryTimeout for — and yet REQs were surviving 19× the configured 10 min timeout. Tracing the flow showed why.
Bug 1: QueryTimeout scope
In the prior implementation,
handleReqwrapped the timeout inside itsiter.Seqclosure:The closure-local
ctxonly reachesstorage.Query. The yield itself — which forwards throughSimpleHandlerBase.ServeNostr'ssend<-msg— uses the outer connection ctx, which has no deadline. Once a receiver stalls, the innerctx.Errcheck never re-runs because yield blocks before the next iteration starts. Result: QueryTimeout covered only the scan phase, not the send-wait phase it was specifically introduced to address.Fix
Reimplement
StorageHandleras a directHandlerso it can own itsServeNostrloop and attachqueryCtxto the exact send sites that can stall. Key points:queryCtx := ctx.WithTimeout(ctx, QueryTimeout)derived once per REQ / COUNT.storage.Query(queryCtx, filters)— scan still respects the deadline.select { send<-msg; <-queryCtx.Done() }.queryCtx.Err()polled at the top of each iteration (fast-path abort).CLOSEDemitted on the outer ctx so it can outlivequeryCtx.ctx.Err()throughServeNostr, same as before.Bug 2: iter.Seq recv-drain — surfaced by #78
While Bug 1 was on the wire as a draft, #78 landed on
mainwith a new CLAUDE.md rule:Bug 1's fix moves
StorageHandlerfromSimpleHandlerBaseto a directHandler— exactly the case the rule targets. After rebasing on main I confirmed it: the newstorageHandler.ServeNostrreadsrecvand forwards REQ responses throughhandleReq's yield loop on the same goroutine, so a stalled receiver halts recv drain entirely. Under an upstreamMergeHandler(childRecvs cap 10) this wedges into the deadlock-spring shape recorded in the diary for 2026-04-24.The interaction with Bug 1 is the operationally important part: without this second fix, every
QueryTimeoutfiring would produce the spring shape forQueryTimeoutms before the abort kicks in. For the salmon configuration that is up to 10 minutes ofMergeHandlerbroadcast wedging per slow REQ — a strict regression compared to the pre-#76 behaviour.Fix
Port simpleHandler's pattern (#78) verbatim:
storageHandlerMsgQueueBuffer, mirroringsimpleHandlerMsgQueueBufferandMergeHandler.childRecvs— all three layers share the same backpressure budget; only ≥1 is required for correctness).msgQueue, dispatches tohandleEvent/handleReq/handleCount, forwards responses onsend.drainCtxis a sub-ctx of the connection ctx withdefer drainCancel()at handler scope, so the drain goroutine is reliably reaped on every main-loop return path without depending on caller-side cancellation — matches the self-contained shape the fix: decouple recv drain from response forwarding in simpleHandler #78 review feedback tightened.Public API impact
None visible.
NewStorageHandlerstill returnsHandler. The internal wrapping viaSimpleHandler(&storageHandler{...})is gone; now it returns&storageHandler{...}directly. Anything type-asserting on the internal shape was already relying on unexported implementation.Tests
TestStorageHandler_QueryTimeout_ReqAbortsOnStalledConsumerUnbuffered send channel → handler delivers 1 EVENT, stalls on the 2nd send → test advances synctest's fake clock past QueryTimeout → asserts the next message is
CLOSED "error: query timeout", not another EVENT.main(before this change): test fails with the 2nd EVENT being delivered (handler ignores the timeout at the send boundary).TestStorageHandler_DrainsRecvWhileYieldBlocks(new)Direct mirror of
TestSimpleHandler_DrainsRecvWhileYieldBlocksfrom #78, applied to storageHandler. Pushes a REQ with two stored events, drains one EVENT to park the main loop on the second yield'ssend<-msg, then pushes a second client message from a helper goroutine and asserts the push completes — i.e. recv drain continues even while yield is stalled.All existing StorageHandler tests — including the existing QueryTimeout tests that exercise the scan path — continue to pass. Full suite +
go test -race ./...clean.Out of scope (but discovered during investigation)
go test -raceflags a pre-existing DATA RACE in MergeHandler / metrics tests (TestMergeHandler_Broadcast_REQTimeoutAdvances,TestMergeHandler_Broadcast_OKForcedFalseOnTimeout,TestMetrics_MergeHandler_LostChildAndBroadcastTimeout). Originally reproducible onmainas of 6d25455 — not introduced by this PR. Worth a separate ticket. (Re-verifying against the currentmainhead after the post-fix(storage_handler): honor QueryTimeout at the send<-msg boundary #76 churn is itself a follow-up.)Follow-ups
completed=falseat ~QueryTimeout (expected: yes) rather than scaling with receiver stall time, and whetherMergeHandlerbroadcast-timeout logs no longer correlate with slow REQs (expected: no correlation, since recv drain now survives independently of yield state).🐛 Generated with Claude Code