Optimize LenVM guided sampling latency by namezhenzhang · Pull Request #3 · UCSB-AI/Length-Value-Model

namezhenzhang · 2026-05-23T21:50:12Z

Summary

Follow-up to the LenVM timing analysis in PR #2. This PR optimizes the in-process LenVM guided sampling hot path and adds a written latency summary in docs/lenvm-guided-sampling-optimization.md.

Main changes:

Skip in-process LenVM initialization for batches without active value guidance, and skip neutral guidance settings such as centered_exp scale 0.
Keep compacted candidate ids/probs/masks on GPU for expectation-style guidance, avoiding per-token CPU candidate materialization in the common path.
Apply guidance in place and only return changed rows, preserving sampling filters for unmodified rows without cloning the full [batch, vocab] probability tensor.
Add a fused LenVM prefix-extend + candidate-scoring path for tiny decode deltas.
Avoid .tolist() GPU syncs in Qwen2/Qwen3/Qwen2.5-VL LenVM value slicing by carrying prefix/candidate metadata in the tree-value spec.
Cache request EOS ids and add opt-in timing logs through SGLANG_LVM_TIMING.
Add focused tests for the fast path, no-op skips, baseline precheck, mixed GPU guidance modes, and tree-value mask/prefill args.

Benchmark Summary

Reference config: 1x H100 SXM, Qwen/Qwen2.5-7B-Instruct + namezz/lvm-math-0402-a-qwen2.5-7b-instruct-b-qwen2.5-1.5b-instruct, GSM8K 50 questions x 16 samples, max_tokens=6000, T=1.0, top-p=1.0, min-p=0.01. Baseline uses top_k=-1; LenVM uses top_k=5, value_mode=centered_exp, value_scale=0.001, gamma=0.997.

run	wall clock	avg completion tokens / choice	acc_first	acc_any
baseline	19.84 s	295.96	0.88	1.00
optimized LenVM	80.02 s	298.36	0.90	0.96
ratio	4.03x slower	+0.81%	-	-

Compared with the PR #2 reference result, 19.22 s -> 87.44 s (4.55x slower), this reduces the guided run wall clock by 8.5%, the slowdown ratio by 11.4%, and the incremental LenVM overhead by 11.8%.

Model memory from the same run: base 7B bf16 weights 14.30 GB; LenVM 1.5B bf16 weights 3.03 GB.

Profiling (SGLANG_LVM_TIMING=1, q10/n4/topk5/scale=0.001) shows baseline Sampler.forward at about 0.70-0.77 ms. With LenVM active, Sampler.forward is about 13-15 ms after warmup and LvmGuidedSampler.apply accounts for about 97% of it. The remaining bottleneck is still rows that fall back to the two-forward LenVM path: prefix extend plus candidate launch.

Artifacts from the Slurm runs:

results/timing/pr7b_q50n16_s001_84065_20260523_211347/
results/timing/pr7b_prof_s001_84067_20260523_212103/

Tests

.venv-infer/bin/python -m compileall -q ... on modified SGLang files and focused tests
git diff --check
Direct focused test invocation through .venv-infer/bin/python because this workspace does not currently have pytest installed:
- test_lvm_guided_sampling_fast_path.py: 7 tests passed
- test_tree_value_spec.py: 2 tests passed

Follow-up

The fused path is still often all-or-nothing at the batch level. The next high-value optimization is to split a batch into fusible and fallback rows, then run fused extend+candidate scoring for eligible rows and keep the two-phase path only for the rest.

When the requested (mode, scale) is mathematically equivalent to vanilla sampling -- centered_exp/value_bias scale=0, mul scale<=0 or scale==1, or other expectation modes at scale==1 -- _req_wants_value_guidance now returns False so the entire LVM path is skipped for that request. The paper's neutral config (centered_exp scale=0) is exactly such a no-op tilt: exp(0 * value) == 1 for all candidates, so the LVM forward runs but contributes nothing to token selection. Skipping it should drop wall_clock from ~72 s back to vanilla (~20 s) at scale=0, with zero change to token-selection behavior. Hard-constraint requests (target_value/target_length/value_constraint/ cmp/op) always need LVM, regardless of scale. Per-req result is cached on the Req via the existing _lvm_wants_guidance slot. Adds _get_req_value_mode_and_scale helper that caches (mode, scale) parsing per req, keyed on custom_params identity + entries. Co-Authored-By: Zhen Zhang <namezhenzhang@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cherry-picked from upstream PR UCSB-AI#3 (namezhenzhang): * lvm_value_utils.get_eos_token_ids: cache result on req._lvm_eos_token_ids; was re-walking sets every candidate filter call. * qwen2_lvm.py / qwen3_lvm.py / qwen2_5_vl_lvm.py: drop two forward_batch.{extend_seq_lens,extend_prefix_lens}.tolist() calls per LVM forward by carrying tree_value_cached_prefix_lens in the TreeValueSpecInput. With ~1300 LVM forwards in the paper run, that removes ~2600 GPU->CPU syncs. * tree_value_spec.py: vectorize the candidate self-attention diagonal using numpy fancy indexing instead of a per-token Python loop. No algorithm change; same forward outputs. Co-Authored-By: Zhen Zhang <namezhenzhang@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…CSB-AI#3) lvm_inproc_runner.py: - eval_candidates_batch_gpu now accepts candidate_lens_per_req so the runner can skip re-walking candidate_ids_per_req lists. - Add extend_and_eval_candidates_batch_gpu(): in steady-state decode each request adds at most 1 prefix token, so we can run prefix-extend + candidate-scoring in ONE forward (Q_len = 1 + k) with the right tree mask. Returns None for unsupported configurations (VLM, page_size != 1, or any per-req prefix delta > 1 token) so callers can fall back. lvm_guided_sampling.py: - PendingLvmResult gains candidate_lens_send so build_pending can pass per-row valid counts down without recomputing. - _Inproc adds tree_value_extend_and_launch_gpu wrapper that schedules the fused kernel on lvm_stream. - apply() GPU path now tries the fused kernel first; on None fall-back runs the original two-phase tree_value_extend + tree_value_launch_gpu. - timer.set_meta(lvm_fused_path=int(fused)) so we can quantify how often fusion actually triggers. Our build_pending CPU/sync trim + vectorized scatter + per-req cache are all preserved; the fused path is opt-in on the GPU branch only. Co-Authored-By: Zhen Zhang <namezhenzhang@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Optimize LenVM guided sampling latency

374565b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize LenVM guided sampling latency#3

Optimize LenVM guided sampling latency#3
namezhenzhang wants to merge 1 commit into
mainfrom
optimize-lenvm-guided-sampling

namezhenzhang commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

namezhenzhang commented May 23, 2026

Summary

Benchmark Summary

Tests

Follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant