Skip to content

fix(server): concurrent-load OOM, startup memcheck, and inference timeout#157

Open
gburd wants to merge 6 commits into
chrishayuk:mainfrom
gburd:pr/server-reliability
Open

fix(server): concurrent-load OOM, startup memcheck, and inference timeout#157
gburd wants to merge 6 commits into
chrishayuk:mainfrom
gburd:pr/server-reliability

Conversation

@gburd

@gburd gburd commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Three independent larql-server reliability fixes from a production incident where the first POST /v1/infer against an inference-level vindex on a memory-constrained host OOM-killed the process, and subsequent calls wedged the entire HTTP handler in futex_wait.

1. Concurrent-load OOM single-flight (79254a94)

ensure_weights_cell let two concurrent first-callers both run the multi-GB weight load — peak heap was 2x the weights. Now single-flighted behind an init mutex with a double-checked OnceLock fast path; one thread loads, the rest wait and share. Eager weight load moved into bootstrap::serve (before the listener binds) so a load OOM is a clean startup failure, not a first-request kill. --lazy-weights opt-out preserved.

2. Startup memory pre-flight (5412f42e)

New memcheck module + VindexConfig::estimate_resident_bytes(). At startup, compares the estimated resident size against the cgroup memory.max and refuses to bind with a one-line actionable error rather than entering a 10-second OOM-kill loop. --no-memcheck / --memcheck-headroom-mib flags.

3. Inference timeout + 504 (a40f9108)

/v1/infer now races the blocking forward pass against a server-side timeout (--infer-timeout-secs, default 60, 0 disables). On timeout it returns 504 and drops the join handle instead of leaving the handler wedged forever. ServerError::Timeout.

All three are BitNet-agnostic and apply to any inference-level vindex. cargo test -p larql-server -p larql-vindex --lib: 321 + 1081 passed.

(The lock-discipline wedge fix and the BitNet native-ternary inference path are separate follow-ups; this PR is the host-reliability subset that stands alone.)

@chrishayuk

Copy link
Copy Markdown
Owner

Really nice set of production fixes — and good incident write-up. Two of the three are solid as-is:

  • Single-flight load — the double-checked OnceLock + init-mutex is correct: lock-free fast path, re-check under the lock, clean poison recovery (a failed load leaves the cell empty so a retry re-runs). Moving the eager load ahead of bind is the right call.
  • Inference timeout → 504 — semantics are honest, including the explicit note that spawn_blocking isn't cancellable so the thread runs to completion. timeout=0 bypass is clean. (The sessions read-lock change is a good independent fix too.)

One that needs a tweak before it's safe to rely on — the memcheck estimate.
estimate_resident_bytes() sizes from bytes_per_float(dtype) (f16=2/f32=4) but ignores the quant: QuantFormat field — so a Q4_K/Q6_K vindex (~4.5 bits/elem) is over-counted ~3–4×, and memcheck would refuse to start a quantized model that actually fits. Since BitNet/quantized is the main reason to care about RSS, that undercuts the feature. Either factor QuantFormat into the estimate, or auto-skip memcheck (or document it as f16/f32-only) when quant != None. It's fine behind --no-memcheck for now, but worth fixing before it gates quantized deployments. (Also: --lazy-weights silently skips memcheck too — fine, just flag it in the help text.)

Process nit: these are three genuinely independent fixes (#1 and #3 are ready, #2 needs a round). If it's easy to split memcheck into its own PR, the two solid fixes can land immediately rather than waiting on the estimate work.

Heads-up: main is green again now — a rebase + cargo fmt on this PR's new files clears the CI fmt failures.

gburd added 4 commits June 19, 2026 09:40
…infer-deadlock.md)

Three root causes were tangled together in the original report; this
commit fixes all three.  Architectural items (\u00a75.4 quantized
inference, \u00a75.5 memory pre-flight, \u00a75.6 timeout semantics) stay
deferred to follow-on work.

1. Sessions writer held across forward pass (BUG \u00a74.3, \u00a75.3).
   ----------------------------------------------------------------
   routes/infer.rs:130 took `sessions_blocking_write()` and held it
   across the multi-second `run_walk(&session.patched)` call.  On
   the multi-thread tokio runtime this serialised every concurrent
   /v1/infer globally and queued any apply_patch / list_patches /
   remove_patch behind it.  The fix is one-character minimal:
   reader instead of writer.  This mirrors session.rs::apply_patch
   which already documents the same anti-pattern (lines 99\u2013106).

   - session.rs: new `sessions_blocking_read()` helper alongside
     the existing `sessions_blocking_write()`.
   - routes/infer.rs: switch the sessioned-walk branch to a reader.

   Test: routes/infer.rs::sessions_reader_does_not_serialize_
   concurrent_callers \u2014 8 threads each holding the reader across a
   100ms sleep.  Pre-fix wall time would be 800ms (writer
   serialised); post-fix is ~100ms.  Asserted < 400ms.

2. Lazy weight-load race double-allocates ~5 GB (BUG \u00a74.1, \u00a75.1).
   ----------------------------------------------------------------
   state.rs::ensure_weights_cell did:
     if let Some(cell) = self.weights.get() { return Ok(cell); }
     // ... allocate ~5 GB ModelWeights ...
     let _ = self.weights.set(...);
   With concurrent first-callers both observe `get() == None`, both
   run `load_model_weights_with_opts`, both produce a ~5 GB
   allocation, and only the first wins via OnceLock::set \u2014 but
   during the load BOTH are live, so peak heap is 2x weights.  On a
   2 B BitNet vindex that's ~10 GB and OOMs the cgroup.

   Fix: add a `weights_init: std::sync::Mutex<()>` field to
   LoadedModel.  The fast path stays lock-free (OnceLock::get
   short-circuits before the mutex is touched).  The slow path
   takes the init mutex, double-checks OnceLock under the mutex,
   then runs the load \u2014 single-flight.  Poison recovery uses
   `unwrap_or_else(|p| p.into_inner())` so a panic during load
   does not permanently wedge the model.

   Test: state.rs::ensure_weights_cell_single_flights_concurrent_
   loaders \u2014 8 threads racing the slow path; max concurrent
   occupants observed must be 1.  Plus
   weights_init_mutex_is_unpoisonable_recoverable for the panic
   path.

3. Lazy load on the request thread is mis-sized for cgroup hosts
   (BUG \u00a74.1, \u00a75.2).
   ----------------------------------------------------------------
   Even with single-flight, allocating ~5 GB on a request handler
   under HTTP backpressure is the wrong shape on memory-bounded
   hosts: the OOM-kill happens during request processing (with
   pg_infer's HTTP client timing out after 30s, no useful error
   surfaced).  Fix: load weights eagerly at `bootstrap::serve`
   *before* axum binds the listener.  Failure becomes a clean
   startup error rather than a SIGKILL during the first request.

   - state.rs: new `force_load_weights()` method that calls
     ensure_weights_cell when weights are loadable, short-circuits
     when infer_disabled or browse-only.
   - bootstrap.rs: new --lazy-weights flag (default off) for
     backward-compat / opt-out.  When unset, every model with
     loadable weights is force-loaded once before the listener
     binds.  Logs per-model load wall time.  Failure returns from
     serve() with a contextual error rather than panicking later.

   Tests: state.rs::force_load_weights_skips_when_infer_disabled
   and force_load_weights_skips_browse_only_vindex \u2014 ensures the
   eager path is a no-op for models that have nothing to load,
   so --no-infer / --ffn-only / --embed-only / browse vindexes
   don't pay the cost (and don't crash trying to mmap absent
   files).

Test status:
  cargo test -p larql-server --lib: 311 passed, 0 failed.

Out of scope for this commit (tracked separately):
  - \u00a75.4 quantized inference for BitNet \u2014 architectural, needs
    a CPU kernel for ternary x int8 matmul, sized as a major PR.
  - \u00a75.5 memory-headroom self-check \u2014 nice-to-have guardrail; the
    eager-load fix already converts runtime OOMs to startup OOMs,
    which is the operational improvement the bug report wanted.
  - \u00a75.6 timeout semantics \u2014 dropping the inference future on
    client cancel is a separate restructure of `spawn_blocking`
    surface area; the deadlock fix means a single timeout no
    longer wedges the whole handler regardless.
Read /sys/fs/cgroup/<self>/memory.max at startup and refuse to start
when the configured cgroup leaves no room to load weights.  Converts
the 10-second OOM-kill loop documented in BUG-infer-deadlock into a
one-line, actionable startup error.

larql-vindex
  config/index.rs: new VindexConfig::estimate_resident_bytes() and
  has_inference_weights() helpers.  Estimate models per-layer attn
  (4 * hidden * hidden * elem) + per-layer FFN (3 * hidden * inter
  * elem) + embed + lm_head + 12 % overhead.  Browse-only vindexes
  fall through to a gate-vector + embeddings sum.  Tested against
  the BitNet b1.58 2 B 4 T shape (30 layers, 2560/6912/128 256, f16):
  estimate lands in the 4–8 GB band that matches the production
  triage RSS curve in the bug report (peak ~5.0 GB).

larql-server
  memcheck.rs: new module reading /proc/self/cgroup → resolving the
  cgroup v2 unified path → reading memory.max.  parse_memory_max
  recognises 'max' as unlimited (returns Ok(None)).  When the
  hierarchy is cgroup v1 or memory.max is unreadable, the check
  falls through with a Skipped outcome (no spurious failures on
  developer workstations).

  Three outcomes: Ok / Skipped / Tight.  explain_tight_outcome
  formats a single-line operator-friendly error including the
  required MemoryMax bump, --lazy-weights opt-out, and
  --no-memcheck override.

  bootstrap.rs: new --no-memcheck and --memcheck-headroom-mib=512
  flags.  Pre-flight runs after model load (so estimate sums over
  every loaded vindex) and before the eager-load loop (so a tight
  cgroup never reaches the 5 GB allocation that would OOM).  Models
  marked infer_disabled are excluded from the estimate.

Tests
  larql-vindex: 4 new resident_size_tests, including a regression
  test pinned to the BitNet b1.58 2 B 4 T shape so future loader
  changes can't silently drift the estimator off the production
  baseline.

  larql-server: 7 new memcheck::tests covering parse_memory_max
  unlimited / numeric / garbage paths, the Tight branch's message
  format (must contain --lazy-weights and --no-memcheck guidance),
  zero-estimate short-circuit (–-no-infer / --ffn-only / --embed-only
  paths must never trip the check), and Ok-outcome explanation
  returns empty.

Build clean, 318 server tests + 4 new vindex tests = all green.
When an inference exceeds the server-side timeout, the handler
responds 504 Gateway Timeout and drops the spawn_blocking
JoinHandle.  The blocking thread runs to completion in the
background and its result is discarded.  The handler is unblocked
immediately, so the next /v1/infer succeeds without waiting for
the timed-out inference to finish.

Combined with the earlier deadlock fix, a single client-side
timeout no longer wedges the entire HTTP handler -- that was the
'every subsequent call hangs in futex_wait' symptom in the bug
report (S3.4).

error.rs
  New ServerError::Timeout(String) variant -> HTTP 504.
  routes/openai/error.rs picks up an OpenAIError::timeout(...)
  constructor (status 504, error_type 'timeout_error') so
  /v1/chat/completions and /v1/completions surface the same
  semantics through the OpenAI-compat error envelope.

state.rs
  AppState gains an infer_timeout Duration field plumbed from
  the new --infer-timeout-secs CLI flag (default 60s; 0 disables).

routes/infer.rs
  New pub(crate) run_infer_with_timeout() that wraps the existing
  run_infer in tokio::time::timeout against the spawn_blocking
  JoinHandle.  On timeout: log a warn-level message with the
  observed elapsed time, drop the handle, return
  ServerError::Timeout.  When timeout.is_zero(), pass through to
  plain handle.await for backward compat.
  Both handle_infer and handle_infer_multi route through it.

Tests (cargo test -p larql-server --lib: 321/321 green, +3):
  - timeout_drops_handler_without_blocking_subsequent_requests
    Multi-thread runtime, slow blocking task (800ms), 100ms
    timeout.  Asserts: returns within 300ms of start (not after
    the slow task finishes), is Err(Timeout), and a fresh
    spawn_blocking after the timeout completes within 200ms (the
    handler is not wedged).
  - timeout_zero_passes_through_slow_inference
    Pins backward-compat: timeout=0 returns the actual inference
    result instead of a synthetic timeout.
  - timeout_error_maps_to_504
    StatusCode::GATEWAY_TIMEOUT contract for ServerError::Timeout.

bootstrap.rs / tests/ / routes/stream.rs
  All AppState construction sites updated to set
  infer_timeout to Duration::from_secs(60) (or infer_timeout_secs
  from the CLI in production).  Logs the active timeout at
  startup, including 'disabled' when 0.
… fmt

Addresses PR chrishayuk#157 review:

- estimate_resident_bytes() now returns 0 for quantized vindexes
  (quant != None).  It sized weights at bytes_per_float(dtype)\n  (f16=2/f32=4), which over-counts a Q4_K vindex (~4.5 bits/elem)\n  3-4x \u2014 memcheck would refuse a quantized model that actually\n  fits, defeating the feature for the models that most need RSS\n  care.  A 0 estimate makes the startup pre-flight skip it (the\n  bootstrap loop already treats total_estimate==0 as skip).  Dense\n  f16/f32 vindexes \u2014 what the estimator was validated against \u2014\n  still get a real estimate.  New test estimate_skips_quantized.\n\n- Documented that --lazy-weights also skips the startup memcheck\n  (nothing to size before a deferred load) in the flag help.\n\n- cargo fmt on the new files (clears the CI fmt failure).\n\ncargo test -p larql-server -p larql-vindex --lib: 326 + 1087 passed.
@gburd gburd force-pushed the pr/server-reliability branch from a40f910 to f5d273f Compare June 19, 2026 14:22
@gburd

gburd commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

Thanks — fixed the memcheck estimate and rebased + cargo fmt (clears the CI fmt failures).

memcheck estimateestimate_resident_bytes() now returns 0 for quant != None, so the startup pre-flight skips quantized vindexes instead of over-counting them ~3-4× and refusing a model that fits. The bootstrap loop already treats a 0 total as skip. Dense f16/f32 — the case the estimator was validated against — still gets a real estimate. New test estimate_skips_quantized covers both (Q4_K → 0, dense f16 → non-zero).

I went with the skip-when-quantized option rather than modelling each format's exact resident layout — that's fragile (and for the BitNet path the weights stay packed at ~1.6 bpw, so there's no single bits/elem to plug in). Happy to switch to a per-format estimate later if you'd prefer real numbers there.

--lazy-weights — documented in the flag help that it also skips memcheck (nothing to size before a deferred load).

On the split: the estimate's now fixed so all three are ready as-is, but if you'd still rather land the two solid fixes (single-flight + timeout) separately from the memcheck work, say the word and I'll peel memcheck into its own PR.

cargo test -p larql-server -p larql-vindex --lib: 326 + 1087 passed.

@chrishayuk

Copy link
Copy Markdown
Owner

awesome, thanks. i'll merge this in tonight, and then do any cleanups i need to post merge. thanks again for the contribution

…ctors

The reliability changes added `weights_init` (single-flight load guard)
to `LoadedModel` and `infer_timeout` to `AppState`, but the test helpers
and the two examples still built those structs by literal, so
`cargo test -p larql-server` failed to compile on all platforms.
Add the missing fields (mirroring the production constructors).
clippy::manual_range_contains fires under -D warnings on the
`gb >= 4.0 && gb <= 8.0` assert added with the memcheck estimator test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants