fix(storage_handler): honor QueryTimeout at the send<-msg boundary by high-moctane · Pull Request #76 · high-moctane/mocrelay

high-moctane · 2026-04-21T14:05:15Z

Summary

Fix the QueryTimeout context-scoping bug so REQ / COUNT stalls can actually be aborted at the send<-msg boundary, not just during scan.
Reimplement StorageHandler as a direct Handler (no longer wrapped by SimpleHandlerBase) so the timeout ctx can cover the exact send sites that stall.
Adopt the iter.Seq recv-drain pattern recorded in CLAUDE.md (rule introduced by fix: decouple recv drain from response forwarding in simpleHandler #78): decouple recv drain from response forwarding so a stalled send<-msg can no longer stop recv drain entirely. Without this, the new direct-Handler form would violate the rule and re-introduce the "deadlock spring" shape for the duration of QueryTimeout every time it fires.
Tests:
- TestStorageHandler_QueryTimeout_ReqAbortsOnStalledConsumer — the original regression test (failing on main, passing here).
- TestStorageHandler_DrainsRecvWhileYieldBlocks — new recv-drain test, mirrors TestSimpleHandler_DrainsRecvWhileYieldBlocks from fix: decouple recv drain from response forwarding in simpleHandler #78.

Why (QueryTimeout scope)

Production logs on salmon (over a recent 24 h window) showed REQ subscriptions with send_wait_ms > 3 h while:

conn.Write stayed at ~5 ms per call (no write_timeout metric trips),
ping_timeout never fired (client kept round-tripping pongs),
the connection eventually closed with io.EOF (client-initiated TCP close).

That is the exact "live but slow client" case #73 introduced QueryTimeout for — and yet REQs were surviving 19× the configured 10 min timeout. Tracing the flow showed why.

Bug 1: QueryTimeout scope

In the prior implementation, handleReq wrapped the timeout inside its iter.Seq closure:

return func(yield func(*ServerMsg) bool) {
    if h.queryTimeout > 0 {
        ctx, cancel = context.WithTimeout(ctx, h.queryTimeout)
        defer cancel()
    }
    events, errFn, closeFn := h.storage.Query(ctx, msg.Filters)
    for event := range events {
        if ctx.Err() != nil { ... break }       // only polled here
        if !yieldTracked(NewServerEventMsg(...)) { return }
        eventsSent++
    }
}

The closure-local ctx only reaches storage.Query. The yield itself — which forwards through SimpleHandlerBase.ServeNostr's send<-msg — uses the outer connection ctx, which has no deadline. Once a receiver stalls, the inner ctx.Err check never re-runs because yield blocks before the next iteration starts. Result: QueryTimeout covered only the scan phase, not the send-wait phase it was specifically introduced to address.

Fix

Reimplement StorageHandler as a direct Handler so it can own its ServeNostr loop and attach queryCtx to the exact send sites that can stall. Key points:

queryCtx := ctx.WithTimeout(ctx, QueryTimeout) derived once per REQ / COUNT.
storage.Query(queryCtx, filters) — scan still respects the deadline.
Every event emission uses select { send<-msg; <-queryCtx.Done() }.
queryCtx.Err() polled at the top of each iteration (fast-path abort).
Terminal CLOSED emitted on the outer ctx so it can outlive queryCtx.
Outer ctx cancellation (connection ending) surfaces as ctx.Err() through ServeNostr, same as before.

Bug 2: iter.Seq recv-drain — surfaced by #78

While Bug 1 was on the wire as a draft, #78 landed on main with a new CLAUDE.md rule:

Handlers that implement Handler directly and consume iter.Seq responses must follow the same shape.

Bug 1's fix moves StorageHandler from SimpleHandlerBase to a direct Handler — exactly the case the rule targets. After rebasing on main I confirmed it: the new storageHandler.ServeNostr reads recv and forwards REQ responses through handleReq's yield loop on the same goroutine, so a stalled receiver halts recv drain entirely. Under an upstream MergeHandler (childRecvs cap 10) this wedges into the deadlock-spring shape recorded in the diary for 2026-04-24.

The interaction with Bug 1 is the operationally important part: without this second fix, every QueryTimeout firing would produce the spring shape for QueryTimeout ms before the abort kicks in. For the salmon configuration that is up to 10 minutes of MergeHandler broadcast wedging per slow REQ — a strict regression compared to the pre-#76 behaviour.

Fix

Port simpleHandler's pattern (#78) verbatim:

Spawn a dedicated recv-drain goroutine fed by a cap-10 internal queue (storageHandlerMsgQueueBuffer, mirroring simpleHandlerMsgQueueBuffer and MergeHandler.childRecvs — all three layers share the same backpressure budget; only ≥1 is required for correctness).
Main loop consumes from msgQueue, dispatches to handleEvent / handleReq / handleCount, forwards responses on send.
drainCtx is a sub-ctx of the connection ctx with defer drainCancel() at handler scope, so the drain goroutine is reliably reaped on every main-loop return path without depending on caller-side cancellation — matches the self-contained shape the fix: decouple recv drain from response forwarding in simpleHandler #78 review feedback tightened.

Public API impact

None visible. NewStorageHandler still returns Handler. The internal wrapping via SimpleHandler(&storageHandler{...}) is gone; now it returns &storageHandler{...} directly. Anything type-asserting on the internal shape was already relying on unexported implementation.

Tests

`TestStorageHandler_QueryTimeout_ReqAbortsOnStalledConsumer`

Unbuffered send channel → handler delivers 1 EVENT, stalls on the 2nd send → test advances synctest's fake clock past QueryTimeout → asserts the next message is CLOSED "error: query timeout", not another EVENT.

On main (before this change): test fails with the 2nd EVENT being delivered (handler ignores the timeout at the send boundary).
On this branch: test passes.

`TestStorageHandler_DrainsRecvWhileYieldBlocks` (new)

Direct mirror of TestSimpleHandler_DrainsRecvWhileYieldBlocks from #78, applied to storageHandler. Pushes a REQ with two stored events, drains one EVENT to park the main loop on the second yield's send<-msg, then pushes a second client message from a helper goroutine and asserts the push completes — i.e. recv drain continues even while yield is stalled.

On the post-rebase pre-fix commit: fails with the deadlock-spring message (recv drain stops the moment yield blocks).
On this branch: passes — drain goroutine keeps recv flowing independently of the main loop's yield state.

All existing StorageHandler tests — including the existing QueryTimeout tests that exercise the scan path — continue to pass. Full suite + go test -race ./... clean.

Out of scope (but discovered during investigation)

go test -race flags a pre-existing DATA RACE in MergeHandler / metrics tests (TestMergeHandler_Broadcast_REQTimeoutAdvances, TestMergeHandler_Broadcast_OKForcedFalseOnTimeout, TestMetrics_MergeHandler_LostChildAndBroadcastTimeout). Originally reproducible on main as of 6d25455 — not introduced by this PR. Worth a separate ticket. (Re-verifying against the current main head after the post-fix(storage_handler): honor QueryTimeout at the send<-msg boundary #76 churn is itself a follow-up.)
The salmon logs also show cases where connections stayed alive for 3 h despite the stall, even though ping/pong should have caught a dead receiver. A separate investigation is in progress using the block profile infrastructure deployed in moctane-mocrelay 67d36ad + mocvps e04bff7. That problem is orthogonal to the QueryTimeout bug fixed here.

Follow-ups

After merge + release, bump moctane-mocrelay's mocrelay dep to pick up both fixes.
Observe whether new slow-REQ logs show completed=false at ~QueryTimeout (expected: yes) rather than scaling with receiver stall time, and whether MergeHandler broadcast-timeout logs no longer correlate with slow REQs (expected: no correlation, since recv drain now survives independently of yield state).

🐛 Generated with Claude Code

Before this change, a REQ could stall for hours despite a configured QueryTimeout: production logs on salmon showed subscriptions with send_wait_ms > 3h while conn.Write stayed at ~5 ms per call and ping/pong continued to round-trip. Traced to a context scoping bug. The prior implementation wrapped ctx.WithTimeout inside handleReq's iter.Seq closure, so the timeout ctx was *local* to that closure. The closure yielded events to SimpleHandlerBase's ServeNostr, which forwarded them to send<-msg under the *outer* (un-timed) ctx. Once a receiver stalled, the outer ctx had no deadline and the inner scan loop's ctx.Err check never ran again (it's only polled at event boundaries, and yield blocks before the next event is fetched). Result: QueryTimeout only covered Query's scan phase, not the send-wait phase that the option was specifically introduced to address. This commit reimplements StorageHandler as a direct Handler (not SimpleHandlerBase) so it can own its ServeNostr loop and attach queryCtx to the exact send sites that can stall. The REQ / COUNT flow derives queryCtx = ctx.WithTimeout(ctx, QueryTimeout) and threads it through: - storage.Query(queryCtx, filters) — scan respects the deadline. - select { send<-msg; <-queryCtx.Done() } for every event emission. - queryCtx.Err() polled at the top of each iteration. - terminal CLOSED emitted on the outer ctx so it can outlive queryCtx. When queryCtx fires with the outer ctx still alive, the REQ is aborted with CLOSED "error: query timeout" as the docs already promised. When the outer ctx is cancelled (connection ending), we surface that error instead so ServeNostr returns cleanly. Adds TestStorageHandler_QueryTimeout_ReqAbortsOnStalledConsumer which reproduces the live-but-slow-client case via an unbuffered send: the handler delivers one event, stalls on the second send; the test advances the fake clock past QueryTimeout and asserts that the next message is CLOSED, not another EVENT. Verified failing on main (prior to this commit) and passing now. Side effect: StorageHandler now implements Handler directly rather than routing through SimpleHandlerBase. No public API change — NewStorageHandler still returns Handler — but the internal wrapping is gone. Follows up on #73 (QueryTimeout introduction). 🐛 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Sakura 🌸 <noreply@anthropic.com>

…ock spring Documents the "deadlock spring" shape applied to storageHandler. Since #76 reimplemented StorageHandler as a direct Handler (no longer wrapped by SimpleHandlerBase), the CLAUDE.md rule introduced in #78 — "Handlers that implement Handler directly and consume iter.Seq responses must follow the same shape" — applies verbatim: a blocked send<-msg inside handleReq's yield loop must not stop recv drain, otherwise an upstream MergeHandler (childRecvs cap 10) wedges until BroadcastTimeout retires the child. The new test pushes a REQ with two stored events into a handler whose send channel has only one consumer; the second yield blocks on send<-msg. A second client message is then pushed from a helper goroutine and the test asserts that the push completes — i.e. that recv drain continues even while the main loop is parked on the first response's stalled send. Currently FAILS: storageHandler shares one goroutine for recv read and yield forwarding, so drain stops the moment yield blocks. The next commit restructures ServeNostr to mirror simpleHandler's drain-goroutine + cap-10 internal queue and this test should pass as-written. 🐛 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Sakura 🌸 <noreply@anthropic.com>

storageHandler implements Handler directly (since #76) and forwards REQ responses through handleReq's iter.Seq yield loop on the same goroutine that reads recv. A blocked send<-msg therefore stops recv drain entirely. Under an upstream MergeHandler (childRecvs cap 10) this wedges into the "deadlock spring" shape recorded in CLAUDE.md and the diary entry for 2026-04-24: childSends fills, yield blocks, recv stops draining, childRecvs fills, broadcastAll trips BroadcastTimeout, child retired. QueryTimeout (#76) bounds the worst case but does not eliminate it — the symptom still occurs, scaled to the timeout duration, every time a slow client trips the deadline. simpleHandler addressed the same problem in #78. The CLAUDE.md rule "Handlers that implement Handler directly and consume iter.Seq responses must follow the same shape" applies verbatim here. Fix: spawn a dedicated recv-drain goroutine that forwards recv into an internal cap-10 queue (storageHandlerMsgQueueBuffer, mirroring simpleHandlerMsgQueueBuffer and MergeHandler.childRecvs); the main loop consumes from msgQueue, dispatches to handleEvent / handleReq / handleCount, and forwards responses. A blocked yield no longer stops recv drain. Lifecycle: - drain goroutine exits on drainCtx cancel or recv close; recv close closes msgQueue, which drives the main loop to return nil. - main loop exits on ctx cancel, msgQueue close, or any handle* error. - drainCtx is a sub-ctx of the connection ctx with `defer drainCancel()` at the handler scope, so the drain goroutine is reliably reaped on every main-loop return path without depending on caller-side cancellation — matching simpleHandler's self-contained shape (the one tightened in #78 review feedback). Cap 10 is mirrored from simpleHandlerMsgQueueBuffer and MergeHandler.childRecvs so all three layers share the same backpressure budget; only ≥1 is required for correctness. A sustained slow consumer is still backstopped by Relay's 10s write timeout long before cap 10 matters in practice. Now passes TestStorageHandler_DrainsRecvWhileYieldBlocks (added in the preceding commit). All existing StorageHandler tests, the full suite, and `go test -race ./...` continue to pass. 🐛 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Sakura 🌸 <noreply@anthropic.com>

…ted semantics Spirits review feedback (🦍 + 🍍 independently): Storage.Query's errFn contract does not promise idempotence across calls. handleReq and handleCount were each invoking errFn() twice — once inside the abort-decision predicate (errors.Is(errFn(), context.DeadlineExceeded)) and once again in the post-loop `if err := errFn(); err != nil` branch. InMemory and Pebble happen to be idempotent today, but a future Storage implementation that surfaces transient errors only on the first call would silently break. Capture the result once into queryErr := errFn() and use it for both the predicate and the subsequent log/branch. No behaviour change for the existing implementations; the contract surface is now defensive against future ones. Spirits review feedback (🦍, low priority): Document the `completed` flag's meaning in maybeLogSlowReq's final call site. The flag is `err == nil` on the EOSE send, so it tracks "the relay finished the subscription cleanly" — not "the client received EOSE", which can't be observed from the relay side. An EOSE that fails because the outer ctx was cancelled mid-send is logged with completed=false, matching the abort path. Tests: TestStorageHandler / full suite / `go test -race ./...` all clean. 🦍🍍 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Sakura 🌸 <noreply@anthropic.com>

high-moctane mentioned this pull request Apr 24, 2026

fix: decouple recv drain from response forwarding in simpleHandler #78

Merged

high-moctane and others added 3 commits April 26, 2026 20:26

high-moctane force-pushed the fix-storage-handler-query-timeout-scope branch from ea131db to 225b560 Compare April 26, 2026 11:36

high-moctane marked this pull request as ready for review April 26, 2026 11:48

high-moctane merged commit 86af7e2 into main Apr 26, 2026
2 checks passed

high-moctane deleted the fix-storage-handler-query-timeout-scope branch April 26, 2026 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(storage_handler): honor QueryTimeout at the send<-msg boundary#76

fix(storage_handler): honor QueryTimeout at the send<-msg boundary#76
high-moctane merged 4 commits into
mainfrom
fix-storage-handler-query-timeout-scope

high-moctane commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

high-moctane commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why (QueryTimeout scope)

Bug 1: QueryTimeout scope

Fix

Bug 2: iter.Seq recv-drain — surfaced by #78

Fix

Public API impact

Tests

TestStorageHandler_QueryTimeout_ReqAbortsOnStalledConsumer

TestStorageHandler_DrainsRecvWhileYieldBlocks (new)

Out of scope (but discovered during investigation)

Follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

high-moctane commented Apr 21, 2026 •

edited

Loading

`TestStorageHandler_QueryTimeout_ReqAbortsOnStalledConsumer`

`TestStorageHandler_DrainsRecvWhileYieldBlocks` (new)