fix(server): concurrent-load OOM, startup memcheck, and inference timeout#157
fix(server): concurrent-load OOM, startup memcheck, and inference timeout#157gburd wants to merge 6 commits into
Conversation
|
Really nice set of production fixes — and good incident write-up. Two of the three are solid as-is:
One that needs a tweak before it's safe to rely on — the memcheck estimate. Process nit: these are three genuinely independent fixes (#1 and #3 are ready, #2 needs a round). If it's easy to split memcheck into its own PR, the two solid fixes can land immediately rather than waiting on the estimate work. Heads-up: |
…infer-deadlock.md)
Three root causes were tangled together in the original report; this
commit fixes all three. Architectural items (\u00a75.4 quantized
inference, \u00a75.5 memory pre-flight, \u00a75.6 timeout semantics) stay
deferred to follow-on work.
1. Sessions writer held across forward pass (BUG \u00a74.3, \u00a75.3).
----------------------------------------------------------------
routes/infer.rs:130 took `sessions_blocking_write()` and held it
across the multi-second `run_walk(&session.patched)` call. On
the multi-thread tokio runtime this serialised every concurrent
/v1/infer globally and queued any apply_patch / list_patches /
remove_patch behind it. The fix is one-character minimal:
reader instead of writer. This mirrors session.rs::apply_patch
which already documents the same anti-pattern (lines 99\u2013106).
- session.rs: new `sessions_blocking_read()` helper alongside
the existing `sessions_blocking_write()`.
- routes/infer.rs: switch the sessioned-walk branch to a reader.
Test: routes/infer.rs::sessions_reader_does_not_serialize_
concurrent_callers \u2014 8 threads each holding the reader across a
100ms sleep. Pre-fix wall time would be 800ms (writer
serialised); post-fix is ~100ms. Asserted < 400ms.
2. Lazy weight-load race double-allocates ~5 GB (BUG \u00a74.1, \u00a75.1).
----------------------------------------------------------------
state.rs::ensure_weights_cell did:
if let Some(cell) = self.weights.get() { return Ok(cell); }
// ... allocate ~5 GB ModelWeights ...
let _ = self.weights.set(...);
With concurrent first-callers both observe `get() == None`, both
run `load_model_weights_with_opts`, both produce a ~5 GB
allocation, and only the first wins via OnceLock::set \u2014 but
during the load BOTH are live, so peak heap is 2x weights. On a
2 B BitNet vindex that's ~10 GB and OOMs the cgroup.
Fix: add a `weights_init: std::sync::Mutex<()>` field to
LoadedModel. The fast path stays lock-free (OnceLock::get
short-circuits before the mutex is touched). The slow path
takes the init mutex, double-checks OnceLock under the mutex,
then runs the load \u2014 single-flight. Poison recovery uses
`unwrap_or_else(|p| p.into_inner())` so a panic during load
does not permanently wedge the model.
Test: state.rs::ensure_weights_cell_single_flights_concurrent_
loaders \u2014 8 threads racing the slow path; max concurrent
occupants observed must be 1. Plus
weights_init_mutex_is_unpoisonable_recoverable for the panic
path.
3. Lazy load on the request thread is mis-sized for cgroup hosts
(BUG \u00a74.1, \u00a75.2).
----------------------------------------------------------------
Even with single-flight, allocating ~5 GB on a request handler
under HTTP backpressure is the wrong shape on memory-bounded
hosts: the OOM-kill happens during request processing (with
pg_infer's HTTP client timing out after 30s, no useful error
surfaced). Fix: load weights eagerly at `bootstrap::serve`
*before* axum binds the listener. Failure becomes a clean
startup error rather than a SIGKILL during the first request.
- state.rs: new `force_load_weights()` method that calls
ensure_weights_cell when weights are loadable, short-circuits
when infer_disabled or browse-only.
- bootstrap.rs: new --lazy-weights flag (default off) for
backward-compat / opt-out. When unset, every model with
loadable weights is force-loaded once before the listener
binds. Logs per-model load wall time. Failure returns from
serve() with a contextual error rather than panicking later.
Tests: state.rs::force_load_weights_skips_when_infer_disabled
and force_load_weights_skips_browse_only_vindex \u2014 ensures the
eager path is a no-op for models that have nothing to load,
so --no-infer / --ffn-only / --embed-only / browse vindexes
don't pay the cost (and don't crash trying to mmap absent
files).
Test status:
cargo test -p larql-server --lib: 311 passed, 0 failed.
Out of scope for this commit (tracked separately):
- \u00a75.4 quantized inference for BitNet \u2014 architectural, needs
a CPU kernel for ternary x int8 matmul, sized as a major PR.
- \u00a75.5 memory-headroom self-check \u2014 nice-to-have guardrail; the
eager-load fix already converts runtime OOMs to startup OOMs,
which is the operational improvement the bug report wanted.
- \u00a75.6 timeout semantics \u2014 dropping the inference future on
client cancel is a separate restructure of `spawn_blocking`
surface area; the deadlock fix means a single timeout no
longer wedges the whole handler regardless.
Read /sys/fs/cgroup/<self>/memory.max at startup and refuse to start when the configured cgroup leaves no room to load weights. Converts the 10-second OOM-kill loop documented in BUG-infer-deadlock into a one-line, actionable startup error. larql-vindex config/index.rs: new VindexConfig::estimate_resident_bytes() and has_inference_weights() helpers. Estimate models per-layer attn (4 * hidden * hidden * elem) + per-layer FFN (3 * hidden * inter * elem) + embed + lm_head + 12 % overhead. Browse-only vindexes fall through to a gate-vector + embeddings sum. Tested against the BitNet b1.58 2 B 4 T shape (30 layers, 2560/6912/128 256, f16): estimate lands in the 4–8 GB band that matches the production triage RSS curve in the bug report (peak ~5.0 GB). larql-server memcheck.rs: new module reading /proc/self/cgroup → resolving the cgroup v2 unified path → reading memory.max. parse_memory_max recognises 'max' as unlimited (returns Ok(None)). When the hierarchy is cgroup v1 or memory.max is unreadable, the check falls through with a Skipped outcome (no spurious failures on developer workstations). Three outcomes: Ok / Skipped / Tight. explain_tight_outcome formats a single-line operator-friendly error including the required MemoryMax bump, --lazy-weights opt-out, and --no-memcheck override. bootstrap.rs: new --no-memcheck and --memcheck-headroom-mib=512 flags. Pre-flight runs after model load (so estimate sums over every loaded vindex) and before the eager-load loop (so a tight cgroup never reaches the 5 GB allocation that would OOM). Models marked infer_disabled are excluded from the estimate. Tests larql-vindex: 4 new resident_size_tests, including a regression test pinned to the BitNet b1.58 2 B 4 T shape so future loader changes can't silently drift the estimator off the production baseline. larql-server: 7 new memcheck::tests covering parse_memory_max unlimited / numeric / garbage paths, the Tight branch's message format (must contain --lazy-weights and --no-memcheck guidance), zero-estimate short-circuit (–-no-infer / --ffn-only / --embed-only paths must never trip the check), and Ok-outcome explanation returns empty. Build clean, 318 server tests + 4 new vindex tests = all green.
When an inference exceeds the server-side timeout, the handler
responds 504 Gateway Timeout and drops the spawn_blocking
JoinHandle. The blocking thread runs to completion in the
background and its result is discarded. The handler is unblocked
immediately, so the next /v1/infer succeeds without waiting for
the timed-out inference to finish.
Combined with the earlier deadlock fix, a single client-side
timeout no longer wedges the entire HTTP handler -- that was the
'every subsequent call hangs in futex_wait' symptom in the bug
report (S3.4).
error.rs
New ServerError::Timeout(String) variant -> HTTP 504.
routes/openai/error.rs picks up an OpenAIError::timeout(...)
constructor (status 504, error_type 'timeout_error') so
/v1/chat/completions and /v1/completions surface the same
semantics through the OpenAI-compat error envelope.
state.rs
AppState gains an infer_timeout Duration field plumbed from
the new --infer-timeout-secs CLI flag (default 60s; 0 disables).
routes/infer.rs
New pub(crate) run_infer_with_timeout() that wraps the existing
run_infer in tokio::time::timeout against the spawn_blocking
JoinHandle. On timeout: log a warn-level message with the
observed elapsed time, drop the handle, return
ServerError::Timeout. When timeout.is_zero(), pass through to
plain handle.await for backward compat.
Both handle_infer and handle_infer_multi route through it.
Tests (cargo test -p larql-server --lib: 321/321 green, +3):
- timeout_drops_handler_without_blocking_subsequent_requests
Multi-thread runtime, slow blocking task (800ms), 100ms
timeout. Asserts: returns within 300ms of start (not after
the slow task finishes), is Err(Timeout), and a fresh
spawn_blocking after the timeout completes within 200ms (the
handler is not wedged).
- timeout_zero_passes_through_slow_inference
Pins backward-compat: timeout=0 returns the actual inference
result instead of a synthetic timeout.
- timeout_error_maps_to_504
StatusCode::GATEWAY_TIMEOUT contract for ServerError::Timeout.
bootstrap.rs / tests/ / routes/stream.rs
All AppState construction sites updated to set
infer_timeout to Duration::from_secs(60) (or infer_timeout_secs
from the CLI in production). Logs the active timeout at
startup, including 'disabled' when 0.
… fmt Addresses PR chrishayuk#157 review: - estimate_resident_bytes() now returns 0 for quantized vindexes (quant != None). It sized weights at bytes_per_float(dtype)\n (f16=2/f32=4), which over-counts a Q4_K vindex (~4.5 bits/elem)\n 3-4x \u2014 memcheck would refuse a quantized model that actually\n fits, defeating the feature for the models that most need RSS\n care. A 0 estimate makes the startup pre-flight skip it (the\n bootstrap loop already treats total_estimate==0 as skip). Dense\n f16/f32 vindexes \u2014 what the estimator was validated against \u2014\n still get a real estimate. New test estimate_skips_quantized.\n\n- Documented that --lazy-weights also skips the startup memcheck\n (nothing to size before a deferred load) in the flag help.\n\n- cargo fmt on the new files (clears the CI fmt failure).\n\ncargo test -p larql-server -p larql-vindex --lib: 326 + 1087 passed.
a40f910 to
f5d273f
Compare
|
Thanks — fixed the memcheck estimate and rebased + memcheck estimate — I went with the skip-when-quantized option rather than modelling each format's exact resident layout — that's fragile (and for the BitNet path the weights stay packed at ~1.6 bpw, so there's no single bits/elem to plug in). Happy to switch to a per-format estimate later if you'd prefer real numbers there.
On the split: the estimate's now fixed so all three are ready as-is, but if you'd still rather land the two solid fixes (single-flight + timeout) separately from the memcheck work, say the word and I'll peel memcheck into its own PR.
|
|
awesome, thanks. i'll merge this in tonight, and then do any cleanups i need to post merge. thanks again for the contribution |
…ctors The reliability changes added `weights_init` (single-flight load guard) to `LoadedModel` and `infer_timeout` to `AppState`, but the test helpers and the two examples still built those structs by literal, so `cargo test -p larql-server` failed to compile on all platforms. Add the missing fields (mirroring the production constructors).
clippy::manual_range_contains fires under -D warnings on the `gb >= 4.0 && gb <= 8.0` assert added with the memcheck estimator test.
Three independent larql-server reliability fixes from a production incident where the first
POST /v1/inferagainst an inference-level vindex on a memory-constrained host OOM-killed the process, and subsequent calls wedged the entire HTTP handler infutex_wait.1. Concurrent-load OOM single-flight (
79254a94)ensure_weights_celllet two concurrent first-callers both run the multi-GB weight load — peak heap was 2x the weights. Now single-flighted behind an init mutex with a double-checkedOnceLockfast path; one thread loads, the rest wait and share. Eager weight load moved intobootstrap::serve(before the listener binds) so a load OOM is a clean startup failure, not a first-request kill.--lazy-weightsopt-out preserved.2. Startup memory pre-flight (
5412f42e)New
memcheckmodule +VindexConfig::estimate_resident_bytes(). At startup, compares the estimated resident size against the cgroupmemory.maxand refuses to bind with a one-line actionable error rather than entering a 10-second OOM-kill loop.--no-memcheck/--memcheck-headroom-mibflags.3. Inference timeout + 504 (
a40f9108)/v1/infernow races the blocking forward pass against a server-side timeout (--infer-timeout-secs, default 60, 0 disables). On timeout it returns 504 and drops the join handle instead of leaving the handler wedged forever.ServerError::Timeout.All three are BitNet-agnostic and apply to any inference-level vindex.
cargo test -p larql-server -p larql-vindex --lib: 321 + 1081 passed.(The lock-discipline wedge fix and the BitNet native-ternary inference path are separate follow-ups; this PR is the host-reliability subset that stands alone.)