Skip to content

ds4-server: SSE keepalive during decode#245

Open
Allen091080 wants to merge 1 commit into
antirez:mainfrom
Allen091080:decode-stream-keepalive
Open

ds4-server: SSE keepalive during decode#245
Allen091080 wants to merge 1 commit into
antirez:mainfrom
Allen091080:decode-stream-keepalive

Conversation

@Allen091080
Copy link
Copy Markdown

Summary

Small follow-up to f91c12b (prefill keepalive via prefill_display).

The prefill side is now well-covered. Decode can still go quiet for tens of seconds when:

  • the model is mid-thinking — <think>...</think> is open and no visible text has been flushed;
  • the model is accumulating a large tool_use input JSON, which is held back until the block closes.

sse_chunk only fires when there is actual streamable text, so the socket sees nothing during those stretches. Past the client's TCP idle threshold (10-60 s on most HTTP libraries), the next sse_chunk call records client stream write failed and the turn errors out — same failure shape that prompted issue #222 on the prefill side, just on the other end of the turn.

Fix

In the decode loop, when j->req.stream is set, emit a : decode\n\n SSE comment line at most every 15 s:

if (j->req.stream) {
    double now_kp = now_sec();
    if (now_kp - decode_last_keepalive >= 15.0) {
        static const char ka[] = ": decode\n\n";
        if (!send_all(j->fd, ka, sizeof(ka) - 1)) {
            finish = "error";
            snprintf(err, sizeof(err),
                     "client stream write failed during decode heartbeat");
            break;
        }
        decode_last_keepalive = now_kp;
    }
}

15 s mirrors the prefill cadence and sits inside common 30-60 s client idle thresholds. Failed write ends the turn via the existing client stream write failed path — no new failure mode for callers.

Scope (intentionally minimal)

  • No watchdog thread.
  • No _exit.
  • No struct fields added.
  • Only one new local variable (decode_last_keepalive).
  • Does not cover GPU/Metal kernel hangs inside ds4_session_* — out of scope.

Verification

Machine: MacBook Pro M5 Max, 128 GiB RAM
Backend: Metal
Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf (q2-imatrix)
Server: ./ds4-server --host 0.0.0.0 --port 8000 --ctx 500000 --kv-disk-dir … --kv-disk-space-mb 204800

  • make clean, no new warnings.
  • ./ds4_test --server passes (server: OK / ds4 tests: ok).
  • Streamed completions where the model thinks for 30+ s now show : decode comment lines on the wire every 15 s, no client disconnect.

Test plan

  • CI runs ./ds4_test --server
  • Manual: chat request that forces a long <think> phase or large tool_use input; observe : decode comment lines on the SSE wire and no client stream write failed errors at the end.

The prefill keepalive added in f027269 and refined in f91c12b
(`prefill_display` events) keeps the connection alive while the model
is processing input.  Once decode starts, the connection can still go
quiet for tens of seconds at a time:

  * the model is mid-thinking — `<think>...</think>` is open and no
    visible text has been flushed to the client yet;
  * the model is accumulating a large tool_use input JSON, which is
    held back until the block closes.

`sse_chunk` only fires when there is actual streamable text, so during
those stretches no bytes go to the client.  Once a client-side TCP
idle-timeout (10-60 s on most HTTP libraries) elapses, the socket is
torn down and the next `sse_chunk` call records `client stream write
failed`, ending the turn with an error.

Add a small wall-clock keepalive in the decode loop: when
`j->req.stream` is set, emit a `: decode\n\n` SSE comment line at most
every 15 seconds.  The 15 s cadence matches the prefill keepalive and
sits comfortably inside common 30-60 s client idle thresholds.  A
failed write here ends the turn with the same `client stream write
failed` reason the regular event writer uses, so callers see no new
failure mode.

This is intentionally a small follow-up to f91c12b — no watchdog
thread, no `_exit`, no new state outside the local variable
`decode_last_keepalive`.  It only addresses decode silence, not GPU
stalls inside `ds4_session_*` calls.

Verified on macOS Metal, q2-imatrix GGUF:

- clean `make` build, no new warnings;
- `./ds4_test --server` passes;
- streamed completions during long thinking phases now see a
  `: decode\n\n` comment every 15 s on the wire instead of silence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant