ds4-server: SSE keepalive during decode#245
Open
Allen091080 wants to merge 1 commit into
Open
Conversation
The prefill keepalive added in f027269 and refined in f91c12b (`prefill_display` events) keeps the connection alive while the model is processing input. Once decode starts, the connection can still go quiet for tens of seconds at a time: * the model is mid-thinking — `<think>...</think>` is open and no visible text has been flushed to the client yet; * the model is accumulating a large tool_use input JSON, which is held back until the block closes. `sse_chunk` only fires when there is actual streamable text, so during those stretches no bytes go to the client. Once a client-side TCP idle-timeout (10-60 s on most HTTP libraries) elapses, the socket is torn down and the next `sse_chunk` call records `client stream write failed`, ending the turn with an error. Add a small wall-clock keepalive in the decode loop: when `j->req.stream` is set, emit a `: decode\n\n` SSE comment line at most every 15 seconds. The 15 s cadence matches the prefill keepalive and sits comfortably inside common 30-60 s client idle thresholds. A failed write here ends the turn with the same `client stream write failed` reason the regular event writer uses, so callers see no new failure mode. This is intentionally a small follow-up to f91c12b — no watchdog thread, no `_exit`, no new state outside the local variable `decode_last_keepalive`. It only addresses decode silence, not GPU stalls inside `ds4_session_*` calls. Verified on macOS Metal, q2-imatrix GGUF: - clean `make` build, no new warnings; - `./ds4_test --server` passes; - streamed completions during long thinking phases now see a `: decode\n\n` comment every 15 s on the wire instead of silence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Small follow-up to f91c12b (prefill keepalive via
prefill_display).The prefill side is now well-covered. Decode can still go quiet for tens of seconds when:
<think>...</think>is open and no visible text has been flushed;tool_useinput JSON, which is held back until the block closes.sse_chunkonly fires when there is actual streamable text, so the socket sees nothing during those stretches. Past the client's TCP idle threshold (10-60 s on most HTTP libraries), the nextsse_chunkcall recordsclient stream write failedand the turn errors out — same failure shape that prompted issue #222 on the prefill side, just on the other end of the turn.Fix
In the decode loop, when
j->req.streamis set, emit a: decode\n\nSSE comment line at most every 15 s:15 s mirrors the prefill cadence and sits inside common 30-60 s client idle thresholds. Failed write ends the turn via the existing
client stream write failedpath — no new failure mode for callers.Scope (intentionally minimal)
_exit.decode_last_keepalive).ds4_session_*— out of scope.Verification
Machine: MacBook Pro M5 Max, 128 GiB RAM
Backend: Metal
Model:
DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf(q2-imatrix)Server:
./ds4-server --host 0.0.0.0 --port 8000 --ctx 500000 --kv-disk-dir … --kv-disk-space-mb 204800makeclean, no new warnings../ds4_test --serverpasses (server: OK/ds4 tests: ok).: decodecomment lines on the wire every 15 s, no client disconnect.Test plan
./ds4_test --server<think>phase or large tool_use input; observe: decodecomment lines on the SSE wire and noclient stream write failederrors at the end.