Skip to content

fix(storage_handler): honor QueryTimeout at the send<-msg boundary#76

Merged
high-moctane merged 4 commits into
mainfrom
fix-storage-handler-query-timeout-scope
Apr 26, 2026
Merged

fix(storage_handler): honor QueryTimeout at the send<-msg boundary#76
high-moctane merged 4 commits into
mainfrom
fix-storage-handler-query-timeout-scope

Conversation

@high-moctane
Copy link
Copy Markdown
Owner

@high-moctane high-moctane commented Apr 21, 2026

Summary

  • Fix the QueryTimeout context-scoping bug so REQ / COUNT stalls can actually be aborted at the send<-msg boundary, not just during scan.
  • Reimplement StorageHandler as a direct Handler (no longer wrapped by SimpleHandlerBase) so the timeout ctx can cover the exact send sites that stall.
  • Adopt the iter.Seq recv-drain pattern recorded in CLAUDE.md (rule introduced by fix: decouple recv drain from response forwarding in simpleHandler #78): decouple recv drain from response forwarding so a stalled send<-msg can no longer stop recv drain entirely. Without this, the new direct-Handler form would violate the rule and re-introduce the "deadlock spring" shape for the duration of QueryTimeout every time it fires.
  • Tests:

Why (QueryTimeout scope)

Production logs on salmon (over a recent 24 h window) showed REQ subscriptions with send_wait_ms > 3 h while:

  • conn.Write stayed at ~5 ms per call (no write_timeout metric trips),
  • ping_timeout never fired (client kept round-tripping pongs),
  • the connection eventually closed with io.EOF (client-initiated TCP close).

That is the exact "live but slow client" case #73 introduced QueryTimeout for — and yet REQs were surviving 19× the configured 10 min timeout. Tracing the flow showed why.

Bug 1: QueryTimeout scope

In the prior implementation, handleReq wrapped the timeout inside its iter.Seq closure:

return func(yield func(*ServerMsg) bool) {
    if h.queryTimeout > 0 {
        ctx, cancel = context.WithTimeout(ctx, h.queryTimeout)
        defer cancel()
    }
    events, errFn, closeFn := h.storage.Query(ctx, msg.Filters)
    for event := range events {
        if ctx.Err() != nil { ... break }       // only polled here
        if !yieldTracked(NewServerEventMsg(...)) { return }
        eventsSent++
    }
}

The closure-local ctx only reaches storage.Query. The yield itself — which forwards through SimpleHandlerBase.ServeNostr's send<-msg — uses the outer connection ctx, which has no deadline. Once a receiver stalls, the inner ctx.Err check never re-runs because yield blocks before the next iteration starts. Result: QueryTimeout covered only the scan phase, not the send-wait phase it was specifically introduced to address.

Fix

Reimplement StorageHandler as a direct Handler so it can own its ServeNostr loop and attach queryCtx to the exact send sites that can stall. Key points:

  • queryCtx := ctx.WithTimeout(ctx, QueryTimeout) derived once per REQ / COUNT.
  • storage.Query(queryCtx, filters) — scan still respects the deadline.
  • Every event emission uses select { send<-msg; <-queryCtx.Done() }.
  • queryCtx.Err() polled at the top of each iteration (fast-path abort).
  • Terminal CLOSED emitted on the outer ctx so it can outlive queryCtx.
  • Outer ctx cancellation (connection ending) surfaces as ctx.Err() through ServeNostr, same as before.

Bug 2: iter.Seq recv-drain — surfaced by #78

While Bug 1 was on the wire as a draft, #78 landed on main with a new CLAUDE.md rule:

Handlers that implement Handler directly and consume iter.Seq responses must follow the same shape.

Bug 1's fix moves StorageHandler from SimpleHandlerBase to a direct Handler — exactly the case the rule targets. After rebasing on main I confirmed it: the new storageHandler.ServeNostr reads recv and forwards REQ responses through handleReq's yield loop on the same goroutine, so a stalled receiver halts recv drain entirely. Under an upstream MergeHandler (childRecvs cap 10) this wedges into the deadlock-spring shape recorded in the diary for 2026-04-24.

The interaction with Bug 1 is the operationally important part: without this second fix, every QueryTimeout firing would produce the spring shape for QueryTimeout ms before the abort kicks in. For the salmon configuration that is up to 10 minutes of MergeHandler broadcast wedging per slow REQ — a strict regression compared to the pre-#76 behaviour.

Fix

Port simpleHandler's pattern (#78) verbatim:

  • Spawn a dedicated recv-drain goroutine fed by a cap-10 internal queue (storageHandlerMsgQueueBuffer, mirroring simpleHandlerMsgQueueBuffer and MergeHandler.childRecvs — all three layers share the same backpressure budget; only ≥1 is required for correctness).
  • Main loop consumes from msgQueue, dispatches to handleEvent / handleReq / handleCount, forwards responses on send.
  • drainCtx is a sub-ctx of the connection ctx with defer drainCancel() at handler scope, so the drain goroutine is reliably reaped on every main-loop return path without depending on caller-side cancellation — matches the self-contained shape the fix: decouple recv drain from response forwarding in simpleHandler #78 review feedback tightened.

Public API impact

None visible. NewStorageHandler still returns Handler. The internal wrapping via SimpleHandler(&storageHandler{...}) is gone; now it returns &storageHandler{...} directly. Anything type-asserting on the internal shape was already relying on unexported implementation.

Tests

TestStorageHandler_QueryTimeout_ReqAbortsOnStalledConsumer

Unbuffered send channel → handler delivers 1 EVENT, stalls on the 2nd send → test advances synctest's fake clock past QueryTimeout → asserts the next message is CLOSED "error: query timeout", not another EVENT.

  • On main (before this change): test fails with the 2nd EVENT being delivered (handler ignores the timeout at the send boundary).
  • On this branch: test passes.

TestStorageHandler_DrainsRecvWhileYieldBlocks (new)

Direct mirror of TestSimpleHandler_DrainsRecvWhileYieldBlocks from #78, applied to storageHandler. Pushes a REQ with two stored events, drains one EVENT to park the main loop on the second yield's send<-msg, then pushes a second client message from a helper goroutine and asserts the push completes — i.e. recv drain continues even while yield is stalled.

  • On the post-rebase pre-fix commit: fails with the deadlock-spring message (recv drain stops the moment yield blocks).
  • On this branch: passes — drain goroutine keeps recv flowing independently of the main loop's yield state.

All existing StorageHandler tests — including the existing QueryTimeout tests that exercise the scan path — continue to pass. Full suite + go test -race ./... clean.

Out of scope (but discovered during investigation)

  • go test -race flags a pre-existing DATA RACE in MergeHandler / metrics tests (TestMergeHandler_Broadcast_REQTimeoutAdvances, TestMergeHandler_Broadcast_OKForcedFalseOnTimeout, TestMetrics_MergeHandler_LostChildAndBroadcastTimeout). Originally reproducible on main as of 6d25455 — not introduced by this PR. Worth a separate ticket. (Re-verifying against the current main head after the post-fix(storage_handler): honor QueryTimeout at the send<-msg boundary #76 churn is itself a follow-up.)
  • The salmon logs also show cases where connections stayed alive for 3 h despite the stall, even though ping/pong should have caught a dead receiver. A separate investigation is in progress using the block profile infrastructure deployed in moctane-mocrelay 67d36ad + mocvps e04bff7. That problem is orthogonal to the QueryTimeout bug fixed here.

Follow-ups

  • After merge + release, bump moctane-mocrelay's mocrelay dep to pick up both fixes.
  • Observe whether new slow-REQ logs show completed=false at ~QueryTimeout (expected: yes) rather than scaling with receiver stall time, and whether MergeHandler broadcast-timeout logs no longer correlate with slow REQs (expected: no correlation, since recv drain now survives independently of yield state).

🐛 Generated with Claude Code

high-moctane and others added 3 commits April 26, 2026 20:26
Before this change, a REQ could stall for hours despite a configured
QueryTimeout: production logs on salmon showed subscriptions with
send_wait_ms > 3h while conn.Write stayed at ~5 ms per call and
ping/pong continued to round-trip. Traced to a context scoping bug.

The prior implementation wrapped ctx.WithTimeout inside handleReq's
iter.Seq closure, so the timeout ctx was *local* to that closure. The
closure yielded events to SimpleHandlerBase's ServeNostr, which
forwarded them to send<-msg under the *outer* (un-timed) ctx. Once a
receiver stalled, the outer ctx had no deadline and the inner scan
loop's ctx.Err check never ran again (it's only polled at event
boundaries, and yield blocks before the next event is fetched). Result:
QueryTimeout only covered Query's scan phase, not the send-wait phase
that the option was specifically introduced to address.

This commit reimplements StorageHandler as a direct Handler (not
SimpleHandlerBase) so it can own its ServeNostr loop and attach
queryCtx to the exact send sites that can stall. The REQ / COUNT flow
derives queryCtx = ctx.WithTimeout(ctx, QueryTimeout) and threads it
through:

  - storage.Query(queryCtx, filters) — scan respects the deadline.
  - select { send<-msg; <-queryCtx.Done() } for every event emission.
  - queryCtx.Err() polled at the top of each iteration.
  - terminal CLOSED emitted on the outer ctx so it can outlive queryCtx.

When queryCtx fires with the outer ctx still alive, the REQ is aborted
with CLOSED "error: query timeout" as the docs already promised. When
the outer ctx is cancelled (connection ending), we surface that error
instead so ServeNostr returns cleanly.

Adds TestStorageHandler_QueryTimeout_ReqAbortsOnStalledConsumer which
reproduces the live-but-slow-client case via an unbuffered send: the
handler delivers one event, stalls on the second send; the test
advances the fake clock past QueryTimeout and asserts that the next
message is CLOSED, not another EVENT. Verified failing on main (prior
to this commit) and passing now.

Side effect: StorageHandler now implements Handler directly rather than
routing through SimpleHandlerBase. No public API change — NewStorageHandler
still returns Handler — but the internal wrapping is gone.

Follows up on #73 (QueryTimeout introduction).

🐛 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Sakura 🌸 <noreply@anthropic.com>
…ock spring

Documents the "deadlock spring" shape applied to storageHandler. Since #76
reimplemented StorageHandler as a direct Handler (no longer wrapped by
SimpleHandlerBase), the CLAUDE.md rule introduced in #78 — "Handlers that
implement Handler directly and consume iter.Seq responses must follow the
same shape" — applies verbatim: a blocked send<-msg inside handleReq's
yield loop must not stop recv drain, otherwise an upstream MergeHandler
(childRecvs cap 10) wedges until BroadcastTimeout retires the child.

The new test pushes a REQ with two stored events into a handler whose
send channel has only one consumer; the second yield blocks on send<-msg.
A second client message is then pushed from a helper goroutine and the
test asserts that the push completes — i.e. that recv drain continues
even while the main loop is parked on the first response's stalled send.

Currently FAILS: storageHandler shares one goroutine for recv read and
yield forwarding, so drain stops the moment yield blocks. The next commit
restructures ServeNostr to mirror simpleHandler's drain-goroutine + cap-10
internal queue and this test should pass as-written.

🐛 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Sakura 🌸 <noreply@anthropic.com>
storageHandler implements Handler directly (since #76) and forwards REQ
responses through handleReq's iter.Seq yield loop on the same goroutine
that reads recv. A blocked send<-msg therefore stops recv drain entirely.
Under an upstream MergeHandler (childRecvs cap 10) this wedges into the
"deadlock spring" shape recorded in CLAUDE.md and the diary entry for
2026-04-24: childSends fills, yield blocks, recv stops draining,
childRecvs fills, broadcastAll trips BroadcastTimeout, child retired.
QueryTimeout (#76) bounds the worst case but does not eliminate it —
the symptom still occurs, scaled to the timeout duration, every time a
slow client trips the deadline.

simpleHandler addressed the same problem in #78. The CLAUDE.md rule
"Handlers that implement Handler directly and consume iter.Seq
responses must follow the same shape" applies verbatim here.

Fix: spawn a dedicated recv-drain goroutine that forwards recv into an
internal cap-10 queue (storageHandlerMsgQueueBuffer, mirroring
simpleHandlerMsgQueueBuffer and MergeHandler.childRecvs); the main loop
consumes from msgQueue, dispatches to handleEvent / handleReq /
handleCount, and forwards responses. A blocked yield no longer stops
recv drain.

Lifecycle:
  - drain goroutine exits on drainCtx cancel or recv close; recv close
    closes msgQueue, which drives the main loop to return nil.
  - main loop exits on ctx cancel, msgQueue close, or any handle*
    error.
  - drainCtx is a sub-ctx of the connection ctx with `defer
    drainCancel()` at the handler scope, so the drain goroutine is
    reliably reaped on every main-loop return path without depending
    on caller-side cancellation — matching simpleHandler's
    self-contained shape (the one tightened in #78 review feedback).

Cap 10 is mirrored from simpleHandlerMsgQueueBuffer and
MergeHandler.childRecvs so all three layers share the same backpressure
budget; only ≥1 is required for correctness. A sustained slow consumer
is still backstopped by Relay's 10s write timeout long before cap 10
matters in practice.

Now passes TestStorageHandler_DrainsRecvWhileYieldBlocks (added in the
preceding commit). All existing StorageHandler tests, the full suite,
and `go test -race ./...` continue to pass.

🐛 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Sakura 🌸 <noreply@anthropic.com>
@high-moctane high-moctane force-pushed the fix-storage-handler-query-timeout-scope branch from ea131db to 225b560 Compare April 26, 2026 11:36
…ted semantics

Spirits review feedback (🦍 + 🍍 independently):

  Storage.Query's errFn contract does not promise idempotence across
  calls. handleReq and handleCount were each invoking errFn() twice —
  once inside the abort-decision predicate (errors.Is(errFn(),
  context.DeadlineExceeded)) and once again in the post-loop
  `if err := errFn(); err != nil` branch. InMemory and Pebble happen
  to be idempotent today, but a future Storage implementation that
  surfaces transient errors only on the first call would silently
  break.

  Capture the result once into queryErr := errFn() and use it for
  both the predicate and the subsequent log/branch. No behaviour
  change for the existing implementations; the contract surface is
  now defensive against future ones.

Spirits review feedback (🦍, low priority):

  Document the `completed` flag's meaning in maybeLogSlowReq's final
  call site. The flag is `err == nil` on the EOSE send, so it tracks
  "the relay finished the subscription cleanly" — not "the client
  received EOSE", which can't be observed from the relay side. An
  EOSE that fails because the outer ctx was cancelled mid-send is
  logged with completed=false, matching the abort path.

Tests: TestStorageHandler / full suite / `go test -race ./...` all
clean.

🦍🍍 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Sakura 🌸 <noreply@anthropic.com>
@high-moctane high-moctane marked this pull request as ready for review April 26, 2026 11:48
@high-moctane high-moctane merged commit 86af7e2 into main Apr 26, 2026
2 checks passed
@high-moctane high-moctane deleted the fix-storage-handler-query-timeout-scope branch April 26, 2026 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant