fix(server): concurrent-load OOM, startup memcheck, and inference timeout by gburd · Pull Request #157 · chrishayuk/larql

gburd · 2026-06-17T20:01:38Z

Three independent larql-server reliability fixes from a production incident where the first POST /v1/infer against an inference-level vindex on a memory-constrained host OOM-killed the process, and subsequent calls wedged the entire HTTP handler in futex_wait.

1. Concurrent-load OOM single-flight (`79254a94`)

ensure_weights_cell let two concurrent first-callers both run the multi-GB weight load — peak heap was 2x the weights. Now single-flighted behind an init mutex with a double-checked OnceLock fast path; one thread loads, the rest wait and share. Eager weight load moved into bootstrap::serve (before the listener binds) so a load OOM is a clean startup failure, not a first-request kill. --lazy-weights opt-out preserved.

2. Startup memory pre-flight (`5412f42e`)

New memcheck module + VindexConfig::estimate_resident_bytes(). At startup, compares the estimated resident size against the cgroup memory.max and refuses to bind with a one-line actionable error rather than entering a 10-second OOM-kill loop. --no-memcheck / --memcheck-headroom-mib flags.

3. Inference timeout + 504 (`a40f9108`)

/v1/infer now races the blocking forward pass against a server-side timeout (--infer-timeout-secs, default 60, 0 disables). On timeout it returns 504 and drops the join handle instead of leaving the handler wedged forever. ServerError::Timeout.

All three are BitNet-agnostic and apply to any inference-level vindex. cargo test -p larql-server -p larql-vindex --lib: 321 + 1081 passed.

(The lock-discipline wedge fix and the BitNet native-ternary inference path are separate follow-ups; this PR is the host-reliability subset that stands alone.)

chrishayuk · 2026-06-18T22:58:23Z

Really nice set of production fixes — and good incident write-up. Two of the three are solid as-is:

✅ Single-flight load — the double-checked OnceLock + init-mutex is correct: lock-free fast path, re-check under the lock, clean poison recovery (a failed load leaves the cell empty so a retry re-runs). Moving the eager load ahead of bind is the right call.
✅ Inference timeout → 504 — semantics are honest, including the explicit note that spawn_blocking isn't cancellable so the thread runs to completion. timeout=0 bypass is clean. (The sessions read-lock change is a good independent fix too.)

One that needs a tweak before it's safe to rely on — the memcheck estimate.
estimate_resident_bytes() sizes from bytes_per_float(dtype) (f16=2/f32=4) but ignores the quant: QuantFormat field — so a Q4_K/Q6_K vindex (~4.5 bits/elem) is over-counted ~3–4×, and memcheck would refuse to start a quantized model that actually fits. Since BitNet/quantized is the main reason to care about RSS, that undercuts the feature. Either factor QuantFormat into the estimate, or auto-skip memcheck (or document it as f16/f32-only) when quant != None. It's fine behind --no-memcheck for now, but worth fixing before it gates quantized deployments. (Also: --lazy-weights silently skips memcheck too — fine, just flag it in the help text.)

Process nit: these are three genuinely independent fixes (#1 and #3 are ready, #2 needs a round). If it's easy to split memcheck into its own PR, the two solid fixes can land immediately rather than waiting on the estimate work.

Heads-up: main is green again now — a rebase + cargo fmt on this PR's new files clears the CI fmt failures.

…infer-deadlock.md) Three root causes were tangled together in the original report; this commit fixes all three. Architectural items (\u00a75.4 quantized inference, \u00a75.5 memory pre-flight, \u00a75.6 timeout semantics) stay deferred to follow-on work. 1. Sessions writer held across forward pass (BUG \u00a74.3, \u00a75.3). ---------------------------------------------------------------- routes/infer.rs:130 took `sessions_blocking_write()` and held it across the multi-second `run_walk(&session.patched)` call. On the multi-thread tokio runtime this serialised every concurrent /v1/infer globally and queued any apply_patch / list_patches / remove_patch behind it. The fix is one-character minimal: reader instead of writer. This mirrors session.rs::apply_patch which already documents the same anti-pattern (lines 99\u2013106). - session.rs: new `sessions_blocking_read()` helper alongside the existing `sessions_blocking_write()`. - routes/infer.rs: switch the sessioned-walk branch to a reader. Test: routes/infer.rs::sessions_reader_does_not_serialize_ concurrent_callers \u2014 8 threads each holding the reader across a 100ms sleep. Pre-fix wall time would be 800ms (writer serialised); post-fix is ~100ms. Asserted < 400ms. 2. Lazy weight-load race double-allocates ~5 GB (BUG \u00a74.1, \u00a75.1). ---------------------------------------------------------------- state.rs::ensure_weights_cell did: if let Some(cell) = self.weights.get() { return Ok(cell); } // ... allocate ~5 GB ModelWeights ... let _ = self.weights.set(...); With concurrent first-callers both observe `get() == None`, both run `load_model_weights_with_opts`, both produce a ~5 GB allocation, and only the first wins via OnceLock::set \u2014 but during the load BOTH are live, so peak heap is 2x weights. On a 2 B BitNet vindex that's ~10 GB and OOMs the cgroup. Fix: add a `weights_init: std::sync::Mutex<()>` field to LoadedModel. The fast path stays lock-free (OnceLock::get short-circuits before the mutex is touched). The slow path takes the init mutex, double-checks OnceLock under the mutex, then runs the load \u2014 single-flight. Poison recovery uses `unwrap_or_else(|p| p.into_inner())` so a panic during load does not permanently wedge the model. Test: state.rs::ensure_weights_cell_single_flights_concurrent_ loaders \u2014 8 threads racing the slow path; max concurrent occupants observed must be 1. Plus weights_init_mutex_is_unpoisonable_recoverable for the panic path. 3. Lazy load on the request thread is mis-sized for cgroup hosts (BUG \u00a74.1, \u00a75.2). ---------------------------------------------------------------- Even with single-flight, allocating ~5 GB on a request handler under HTTP backpressure is the wrong shape on memory-bounded hosts: the OOM-kill happens during request processing (with pg_infer's HTTP client timing out after 30s, no useful error surfaced). Fix: load weights eagerly at `bootstrap::serve` *before* axum binds the listener. Failure becomes a clean startup error rather than a SIGKILL during the first request. - state.rs: new `force_load_weights()` method that calls ensure_weights_cell when weights are loadable, short-circuits when infer_disabled or browse-only. - bootstrap.rs: new --lazy-weights flag (default off) for backward-compat / opt-out. When unset, every model with loadable weights is force-loaded once before the listener binds. Logs per-model load wall time. Failure returns from serve() with a contextual error rather than panicking later. Tests: state.rs::force_load_weights_skips_when_infer_disabled and force_load_weights_skips_browse_only_vindex \u2014 ensures the eager path is a no-op for models that have nothing to load, so --no-infer / --ffn-only / --embed-only / browse vindexes don't pay the cost (and don't crash trying to mmap absent files). Test status: cargo test -p larql-server --lib: 311 passed, 0 failed. Out of scope for this commit (tracked separately): - \u00a75.4 quantized inference for BitNet \u2014 architectural, needs a CPU kernel for ternary x int8 matmul, sized as a major PR. - \u00a75.5 memory-headroom self-check \u2014 nice-to-have guardrail; the eager-load fix already converts runtime OOMs to startup OOMs, which is the operational improvement the bug report wanted. - \u00a75.6 timeout semantics \u2014 dropping the inference future on client cancel is a separate restructure of `spawn_blocking` surface area; the deadlock fix means a single timeout no longer wedges the whole handler regardless.

Read /sys/fs/cgroup/<self>/memory.max at startup and refuse to start when the configured cgroup leaves no room to load weights. Converts the 10-second OOM-kill loop documented in BUG-infer-deadlock into a one-line, actionable startup error. larql-vindex config/index.rs: new VindexConfig::estimate_resident_bytes() and has_inference_weights() helpers. Estimate models per-layer attn (4 * hidden * hidden * elem) + per-layer FFN (3 * hidden * inter * elem) + embed + lm_head + 12 % overhead. Browse-only vindexes fall through to a gate-vector + embeddings sum. Tested against the BitNet b1.58 2 B 4 T shape (30 layers, 2560/6912/128 256, f16): estimate lands in the 4–8 GB band that matches the production triage RSS curve in the bug report (peak ~5.0 GB). larql-server memcheck.rs: new module reading /proc/self/cgroup → resolving the cgroup v2 unified path → reading memory.max. parse_memory_max recognises 'max' as unlimited (returns Ok(None)). When the hierarchy is cgroup v1 or memory.max is unreadable, the check falls through with a Skipped outcome (no spurious failures on developer workstations). Three outcomes: Ok / Skipped / Tight. explain_tight_outcome formats a single-line operator-friendly error including the required MemoryMax bump, --lazy-weights opt-out, and --no-memcheck override. bootstrap.rs: new --no-memcheck and --memcheck-headroom-mib=512 flags. Pre-flight runs after model load (so estimate sums over every loaded vindex) and before the eager-load loop (so a tight cgroup never reaches the 5 GB allocation that would OOM). Models marked infer_disabled are excluded from the estimate. Tests larql-vindex: 4 new resident_size_tests, including a regression test pinned to the BitNet b1.58 2 B 4 T shape so future loader changes can't silently drift the estimator off the production baseline. larql-server: 7 new memcheck::tests covering parse_memory_max unlimited / numeric / garbage paths, the Tight branch's message format (must contain --lazy-weights and --no-memcheck guidance), zero-estimate short-circuit (–-no-infer / --ffn-only / --embed-only paths must never trip the check), and Ok-outcome explanation returns empty. Build clean, 318 server tests + 4 new vindex tests = all green.

When an inference exceeds the server-side timeout, the handler responds 504 Gateway Timeout and drops the spawn_blocking JoinHandle. The blocking thread runs to completion in the background and its result is discarded. The handler is unblocked immediately, so the next /v1/infer succeeds without waiting for the timed-out inference to finish. Combined with the earlier deadlock fix, a single client-side timeout no longer wedges the entire HTTP handler -- that was the 'every subsequent call hangs in futex_wait' symptom in the bug report (S3.4). error.rs New ServerError::Timeout(String) variant -> HTTP 504. routes/openai/error.rs picks up an OpenAIError::timeout(...) constructor (status 504, error_type 'timeout_error') so /v1/chat/completions and /v1/completions surface the same semantics through the OpenAI-compat error envelope. state.rs AppState gains an infer_timeout Duration field plumbed from the new --infer-timeout-secs CLI flag (default 60s; 0 disables). routes/infer.rs New pub(crate) run_infer_with_timeout() that wraps the existing run_infer in tokio::time::timeout against the spawn_blocking JoinHandle. On timeout: log a warn-level message with the observed elapsed time, drop the handle, return ServerError::Timeout. When timeout.is_zero(), pass through to plain handle.await for backward compat. Both handle_infer and handle_infer_multi route through it. Tests (cargo test -p larql-server --lib: 321/321 green, +3): - timeout_drops_handler_without_blocking_subsequent_requests Multi-thread runtime, slow blocking task (800ms), 100ms timeout. Asserts: returns within 300ms of start (not after the slow task finishes), is Err(Timeout), and a fresh spawn_blocking after the timeout completes within 200ms (the handler is not wedged). - timeout_zero_passes_through_slow_inference Pins backward-compat: timeout=0 returns the actual inference result instead of a synthetic timeout. - timeout_error_maps_to_504 StatusCode::GATEWAY_TIMEOUT contract for ServerError::Timeout. bootstrap.rs / tests/ / routes/stream.rs All AppState construction sites updated to set infer_timeout to Duration::from_secs(60) (or infer_timeout_secs from the CLI in production). Logs the active timeout at startup, including 'disabled' when 0.

… fmt Addresses PR chrishayuk#157 review: - estimate_resident_bytes() now returns 0 for quantized vindexes (quant != None). It sized weights at bytes_per_float(dtype)\n (f16=2/f32=4), which over-counts a Q4_K vindex (~4.5 bits/elem)\n 3-4x \u2014 memcheck would refuse a quantized model that actually\n fits, defeating the feature for the models that most need RSS\n care. A 0 estimate makes the startup pre-flight skip it (the\n bootstrap loop already treats total_estimate==0 as skip). Dense\n f16/f32 vindexes \u2014 what the estimator was validated against \u2014\n still get a real estimate. New test estimate_skips_quantized.\n\n- Documented that --lazy-weights also skips the startup memcheck\n (nothing to size before a deferred load) in the flag help.\n\n- cargo fmt on the new files (clears the CI fmt failure).\n\ncargo test -p larql-server -p larql-vindex --lib: 326 + 1087 passed.

gburd · 2026-06-19T14:24:16Z

Thanks — fixed the memcheck estimate and rebased + cargo fmt (clears the CI fmt failures).

memcheck estimate — estimate_resident_bytes() now returns 0 for quant != None, so the startup pre-flight skips quantized vindexes instead of over-counting them ~3-4× and refusing a model that fits. The bootstrap loop already treats a 0 total as skip. Dense f16/f32 — the case the estimator was validated against — still gets a real estimate. New test estimate_skips_quantized covers both (Q4_K → 0, dense f16 → non-zero).

I went with the skip-when-quantized option rather than modelling each format's exact resident layout — that's fragile (and for the BitNet path the weights stay packed at ~1.6 bpw, so there's no single bits/elem to plug in). Happy to switch to a per-format estimate later if you'd prefer real numbers there.

--lazy-weights — documented in the flag help that it also skips memcheck (nothing to size before a deferred load).

On the split: the estimate's now fixed so all three are ready as-is, but if you'd still rather land the two solid fixes (single-flight + timeout) separately from the memcheck work, say the word and I'll peel memcheck into its own PR.

cargo test -p larql-server -p larql-vindex --lib: 326 + 1087 passed.

chrishayuk · 2026-06-19T15:14:03Z

awesome, thanks. i'll merge this in tonight, and then do any cleanups i need to post merge. thanks again for the contribution

…ctors The reliability changes added `weights_init` (single-flight load guard) to `LoadedModel` and `infer_timeout` to `AppState`, but the test helpers and the two examples still built those structs by literal, so `cargo test -p larql-server` failed to compile on all platforms. Add the missing fields (mirroring the production constructors).

clippy::manual_range_contains fires under -D warnings on the `gb >= 4.0 && gb <= 8.0` assert added with the memcheck estimator test.

gburd added 4 commits June 19, 2026 09:40

gburd force-pushed the pr/server-reliability branch from a40f910 to f5d273f Compare June 19, 2026 14:22

chrishayuk added 2 commits June 19, 2026 19:05

fix(vindex): use RangeInclusive::contains in resident-estimate test

a97a7eb

clippy::manual_range_contains fires under -D warnings on the `gb >= 4.0 && gb <= 8.0` assert added with the memcheck estimator test.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): concurrent-load OOM, startup memcheck, and inference timeout#157

fix(server): concurrent-load OOM, startup memcheck, and inference timeout#157
gburd wants to merge 6 commits into
chrishayuk:mainfrom
gburd:pr/server-reliability

gburd commented Jun 17, 2026

Uh oh!

chrishayuk commented Jun 18, 2026

Uh oh!

gburd commented Jun 19, 2026

Uh oh!

chrishayuk commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gburd commented Jun 17, 2026

1. Concurrent-load OOM single-flight (79254a94)

2. Startup memory pre-flight (5412f42e)

3. Inference timeout + 504 (a40f9108)

Uh oh!

chrishayuk commented Jun 18, 2026

Uh oh!

gburd commented Jun 19, 2026

Uh oh!

chrishayuk commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Concurrent-load OOM single-flight (`79254a94`)

2. Startup memory pre-flight (`5412f42e`)

3. Inference timeout + 504 (`a40f9108`)