Skip to content

Optimize LenVM guided sampling latency#3

Open
namezhenzhang wants to merge 1 commit into
mainfrom
optimize-lenvm-guided-sampling
Open

Optimize LenVM guided sampling latency#3
namezhenzhang wants to merge 1 commit into
mainfrom
optimize-lenvm-guided-sampling

Conversation

@namezhenzhang
Copy link
Copy Markdown
Collaborator

Summary

Follow-up to the LenVM timing analysis in PR #2. This PR optimizes the in-process LenVM guided sampling hot path and adds a written latency summary in docs/lenvm-guided-sampling-optimization.md.

Main changes:

  • Skip in-process LenVM initialization for batches without active value guidance, and skip neutral guidance settings such as centered_exp scale 0.
  • Keep compacted candidate ids/probs/masks on GPU for expectation-style guidance, avoiding per-token CPU candidate materialization in the common path.
  • Apply guidance in place and only return changed rows, preserving sampling filters for unmodified rows without cloning the full [batch, vocab] probability tensor.
  • Add a fused LenVM prefix-extend + candidate-scoring path for tiny decode deltas.
  • Avoid .tolist() GPU syncs in Qwen2/Qwen3/Qwen2.5-VL LenVM value slicing by carrying prefix/candidate metadata in the tree-value spec.
  • Cache request EOS ids and add opt-in timing logs through SGLANG_LVM_TIMING.
  • Add focused tests for the fast path, no-op skips, baseline precheck, mixed GPU guidance modes, and tree-value mask/prefill args.

Benchmark Summary

Reference config: 1x H100 SXM, Qwen/Qwen2.5-7B-Instruct + namezz/lvm-math-0402-a-qwen2.5-7b-instruct-b-qwen2.5-1.5b-instruct, GSM8K 50 questions x 16 samples, max_tokens=6000, T=1.0, top-p=1.0, min-p=0.01. Baseline uses top_k=-1; LenVM uses top_k=5, value_mode=centered_exp, value_scale=0.001, gamma=0.997.

run wall clock avg completion tokens / choice acc_first acc_any
baseline 19.84 s 295.96 0.88 1.00
optimized LenVM 80.02 s 298.36 0.90 0.96
ratio 4.03x slower +0.81% - -

Compared with the PR #2 reference result, 19.22 s -> 87.44 s (4.55x slower), this reduces the guided run wall clock by 8.5%, the slowdown ratio by 11.4%, and the incremental LenVM overhead by 11.8%.

Model memory from the same run: base 7B bf16 weights 14.30 GB; LenVM 1.5B bf16 weights 3.03 GB.

Profiling (SGLANG_LVM_TIMING=1, q10/n4/topk5/scale=0.001) shows baseline Sampler.forward at about 0.70-0.77 ms. With LenVM active, Sampler.forward is about 13-15 ms after warmup and LvmGuidedSampler.apply accounts for about 97% of it. The remaining bottleneck is still rows that fall back to the two-forward LenVM path: prefix extend plus candidate launch.

Artifacts from the Slurm runs:

  • results/timing/pr7b_q50n16_s001_84065_20260523_211347/
  • results/timing/pr7b_prof_s001_84067_20260523_212103/

Tests

  • .venv-infer/bin/python -m compileall -q ... on modified SGLang files and focused tests
  • git diff --check
  • Direct focused test invocation through .venv-infer/bin/python because this workspace does not currently have pytest installed:
    • test_lvm_guided_sampling_fast_path.py: 7 tests passed
    • test_tree_value_spec.py: 2 tests passed

Follow-up

The fused path is still often all-or-nothing at the batch level. The next high-value optimization is to split a batch into fusible and fallback rows, then run fused extend+candidate scoring for eligible rows and keep the two-phase path only for the rest.

ChangyiYang added a commit to ChangyiYang/Length-Value-Model that referenced this pull request May 24, 2026
When the requested (mode, scale) is mathematically equivalent to vanilla
sampling -- centered_exp/value_bias scale=0, mul scale<=0 or scale==1,
or other expectation modes at scale==1 -- _req_wants_value_guidance now
returns False so the entire LVM path is skipped for that request.

The paper's neutral config (centered_exp scale=0) is exactly such a
no-op tilt: exp(0 * value) == 1 for all candidates, so the LVM forward
runs but contributes nothing to token selection. Skipping it should
drop wall_clock from ~72 s back to vanilla (~20 s) at scale=0, with
zero change to token-selection behavior.

Hard-constraint requests (target_value/target_length/value_constraint/
cmp/op) always need LVM, regardless of scale. Per-req result is cached
on the Req via the existing _lvm_wants_guidance slot.

Adds _get_req_value_mode_and_scale helper that caches (mode, scale)
parsing per req, keyed on custom_params identity + entries.

Co-Authored-By: Zhen Zhang <namezhenzhang@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChangyiYang added a commit to ChangyiYang/Length-Value-Model that referenced this pull request May 24, 2026
Cherry-picked from upstream PR UCSB-AI#3 (namezhenzhang):

* lvm_value_utils.get_eos_token_ids: cache result on req._lvm_eos_token_ids;
  was re-walking sets every candidate filter call.
* qwen2_lvm.py / qwen3_lvm.py / qwen2_5_vl_lvm.py: drop two
  forward_batch.{extend_seq_lens,extend_prefix_lens}.tolist() calls per
  LVM forward by carrying tree_value_cached_prefix_lens in the
  TreeValueSpecInput. With ~1300 LVM forwards in the paper run, that
  removes ~2600 GPU->CPU syncs.
* tree_value_spec.py: vectorize the candidate self-attention diagonal
  using numpy fancy indexing instead of a per-token Python loop.

No algorithm change; same forward outputs.

Co-Authored-By: Zhen Zhang <namezhenzhang@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChangyiYang added a commit to ChangyiYang/Length-Value-Model that referenced this pull request May 24, 2026
…CSB-AI#3)

lvm_inproc_runner.py:
  - eval_candidates_batch_gpu now accepts candidate_lens_per_req so
    the runner can skip re-walking candidate_ids_per_req lists.
  - Add extend_and_eval_candidates_batch_gpu(): in steady-state decode
    each request adds at most 1 prefix token, so we can run prefix-extend
    + candidate-scoring in ONE forward (Q_len = 1 + k) with the right
    tree mask. Returns None for unsupported configurations (VLM, page_size
    != 1, or any per-req prefix delta > 1 token) so callers can fall back.

lvm_guided_sampling.py:
  - PendingLvmResult gains candidate_lens_send so build_pending can pass
    per-row valid counts down without recomputing.
  - _Inproc adds tree_value_extend_and_launch_gpu wrapper that schedules
    the fused kernel on lvm_stream.
  - apply() GPU path now tries the fused kernel first; on None fall-back
    runs the original two-phase tree_value_extend + tree_value_launch_gpu.
  - timer.set_meta(lvm_fused_path=int(fused)) so we can quantify how
    often fusion actually triggers.

Our build_pending CPU/sync trim + vectorized scatter + per-req cache
are all preserved; the fused path is opt-in on the GPU branch only.

Co-Authored-By: Zhen Zhang <namezhenzhang@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant