Skip to content

Add LenVM inference timing analysis (baseline vs LenVM-guided)#2

Open
ChangyiYang wants to merge 5 commits into
UCSB-AI:mainfrom
ChangyiYang:lenvm-timing-analysis
Open

Add LenVM inference timing analysis (baseline vs LenVM-guided)#2
ChangyiYang wants to merge 5 commits into
UCSB-AI:mainfrom
ChangyiYang:lenvm-timing-analysis

Conversation

@ChangyiYang
Copy link
Copy Markdown

@ChangyiYang ChangyiYang commented May 22, 2026

Summary

Adds an end-to-end + per-decoding-step latency comparison between vanilla SGLang sampling and LenVM-guided sampling, motivated by the LenVM paper review's repeated ask for inference overhead numbers (VVLr / Msb9 / iyxs: wall-clock cost, candidate-set / value-scoring breakdown, FLOPs, batching behavior).

  • sglang-LenVM/python/sglang/srt/lvm/timing.py: env-gated (SGLANG_LVM_TIMING_LOG) per-step JSONL logger; no-op when unset.
  • Instruments Sampler.forward (pre-LVM / LVM apply / sample sections) and LvmGuidedSampler.apply (build_pending / LVM forward / apply_guidance).
  • scripts/inference/lenvm_timing.sh: orchestrator that runs two server lifecycles (baseline → LenVM in-proc) against the same GSM8K prompt set.
  • inference/timing/run_timing.py: wraps sample_eval and records wall-clock + nvidia-smi samples + token-count summary.
  • inference/timing/flops.py: layer-level theoretical FLOPs estimator that reads each model's HF config.json and counts Q/K/V/O projections (GQA-aware), SwiGLU MLP, position-dependent attention (Q@K^T + attn@V), and the top-of-stack head separately. ModelConfig.head_type distinguishes the base model's lm_head (2*d*V vocab projection) from a LenVM checkpoint's MLP2SiLUValueHead (d*d + d*1); autodetected by looking for value_head.safetensors or LengthValueModel-style architectures next to config.json. Runs are split into prefill (charged once per unique prompt; assumes SGLang prefix cache is on) and decode (per sample). LenVM-guided runs add one tree_value_extend + k candidate forwards per generated token. candidate_cost_multiplier defaults to 1.0 (current sglang-LenVM in-proc behavior, where each candidate is a separate single-token forward sharing only the extended KV cache); set lower if a future implementation batches the k candidates.
  • inference/timing/analyze.py: aggregates both JSONL streams + theoretical FLOPs; emits summary.csv / summary.json, three stdout tables, and three plots.

Results

Setup: 1× H100 SXM, Qwen2.5-7B-Instruct + lvm-math-0402-a-qwen2.5-7b-instruct-b-qwen2.5-1.5b-instruct, GSM8K 50 questions × 16 samples, max_tokens=6000, T=1.0, top-p=1.0, baseline top-k=-1, LenVM top-k=5 value-scale=0 (paper config).

End-to-end + sampler-side per-step

metric baseline LenVM ratio
end-to-end wall clock (s) 19.22 87.44 4.55× slower
throughput (output tok/s) 12,366 2,719 4.55× drop
total output tokens 237,712 237,724 ~same
mean tokens / question 4,754 4,754 ~same
sampler steps 1,463 1,386 ~same
mean Sampler.forward (ms / step) 0.19 50.31 268×
├ pre-LVM (preprocess + softmax) 0.07 0.11 ~same
├ LenVM apply() outer 0.00 50.03
└ sample kernel 0.12 0.17 ~same

LenVM apply() internal decomposition (mean ms / step):

section mean ms share of apply()
build_pending (CPU candidate + prefix prep) 12.25 23%
LenVM forward (extend + launch + collect on 1.5B value model) 36.58 67%
apply_guidance (CPU value → probs writeback) 5.29 10%

Layer-level theoretical FLOPs vs measured wall-clock (k=5)

flops.py reads each model's HF config.json and counts matmuls layer by layer: Q/K/V/O projections, SwiGLU MLP, position-dependent attention (Q@K^T + attn@V), and head (vocab projection for the base model, MLP2SiLUValueHead for the LenVM checkpoint — d*d + d*1, not d*V). LenVM-extra FLOPs = LenVM prefill (per unique prompt) + 1 tree_value_extend + k = 5 candidate forwards per output token (matching LENVM_TOP_K=5 in the run config above).

metric baseline LenVM ratio
per-output-token cost (GFLOPs) 14.24 30.24 2.12× more FLOPs
total compute (PFLOPs) 3.46 7.28 2.10×
achieved throughput (TFLOPs/s) 179.9 83.2 0.46×
H100 bf16 peak utilization ~18% ~9%
wall-clock 19.22 s 87.44 s 4.55×

Component decomposition (PFLOPs):

base.linear base.attention base.lm_head lvm.extend lvm.candidates lvm.prefill
baseline 3.17 0.02 0.26
LenVM 3.17 0.02 0.26 0.63 3.17 0.01

The theoretical FLOPs ratio is 2.10× (each generated token drives k=5 LenVM candidate evaluations + 1 LenVM extend in addition to the base forward; ~83% of the LenVM-only FLOPs are candidate forwards through the 1.5B value model with a tiny MLP2SiLUValueHead). The measured wall-clock ratio is 4.55×. Roughly half of the LenVM slowdown is genuine extra compute; the other half is GPU underutilization: the base sampler stream stalls on CPU candidate prep, the LenVM forward runs on a separate CUDA stream with sync overhead, and the LenVM forward's effective batch is smaller than the base model's. Achieved TFLOPs/s drops from ~18% of H100 bf16 peak to ~9%.

Top-k ablation (k ∈ {1,2,3,4,5})

Same setup as above; only LENVM_TOP_K varies.

k baseline wall (s) LenVM wall (s) wall_ratio theoretical FLOPs ratio LenVM apply (ms/step) LenVM forward (ms/step)
1 24.19 23.19 0.96 1.30
2 23.22 80.37 3.46 1.55 45.93 32.18
3 27.21 81.30 2.99 1.74 51.15 34.16
4 22.30 83.53 3.75 1.91 46.23 33.87
5 18.19 80.50 4.43 2.08 48.37 33.49

Two findings from the sweep:

  • k=1 is degenerate. With top_k=1 and temperature=1.0 SGLang takes the is_all_greedy fast path inside Sampler.forward and never enters the LenVM apply() hook (all 1193 recorded decode steps have is_greedy=true, lvm_active=false). The reported "k=1" row above is therefore not LenVM with a one-element candidate set; it is the base 7B running greedy decoding while the LenVM weights sit idle on the GPU. Empirical LenVM measurements start at k ≥ 2.
  • k=2 → k=5 LenVM wall-clock is essentially flat (~80–84 s), even though theoretical FLOPs grow from 1.55× to 2.08×. Per-step LenVM apply latency is 46–51 ms across the range; LenVM forward latency is 32–34 ms. Doubling the candidate-set size from 2 to 5 only moves wall-clock by a few percent. The bottleneck inside LvmGuidedSampler.apply is per-step sync + CPU candidate prep, not the LenVM forward batch. Practically: paper-configured k=5 is no more expensive in wall clock than k=2, so increasing k inside this range is roughly free.

Aggregator: inference/timing/sweep_analyze.py reads multiple result dirs and emits sweep_summary.csv/json + topk_sweep.png. The cluster-side wrapper that drives the sweep (scripts/_run_timing_topk_sweep.sh) is not in this PR because it bakes in local CUDA paths.

0.5B vs 1.5B LenVM ablation

Re-running the same k ∈ {1..5} sweep with the smaller lvm-a-qwen2.5-7b-instruct-b-qwen2.5-0.5b-instruct value model to isolate model-size vs sync-overhead contributions.

metric 1.5B (k=5) 0.5B (k=5) 0.5B / 1.5B
LenVM wall-clock (s) 80.50 65.27 0.81×
theoretical FLOPs ratio over baseline 2.08 1.31
LenVM apply (ms/step) 48.37 38.45 0.79×
LenVM forward (ms/step) 33.49 23.14 0.69×
achieved TFLOPs/s (LenVM) 90.78 69.42 0.77×

Per-k LenVM wall-clock for both models (k=1 omitted; degenerate greedy path):

k 1.5B lvm_s 1.5B apply ms 0.5B lvm_s 0.5B apply ms
2 80.4 45.9 63.3 38.9
3 81.3 51.2 60.2 37.6
4 83.5 46.2 62.2 38.6
5 80.5 48.4 65.3 38.5

Findings:

  • Switching the value model from 1.5B → 0.5B (~3× fewer params) only saves ~19% LenVM wall-clock (80 → 65 s) and ~21% per-step apply latency (48 → 38 ms). The per-step LenVM forward itself drops more (33 → 23 ms, ~30%), but the CPU-side build_pending + apply_guidance (~15 ms) is unchanged across model sizes, so it dominates a larger share at smaller models.
  • The 0.5B utilization gap is wider than the 1.5B's (wall_ratio / flops_ratio = 3.23 / 1.31 = 2.46× for 0.5B vs 4.43 / 2.08 = 2.13× for 1.5B). Reducing the value-model size moves the workload further into memory-bound / sync-bound territory; raw compute is not the limiting factor.
  • The k=2 → k=5 plateau holds at 0.5B too: apply latency ~38 ms across k, wall-clock 60–65 s.

Practical implication: shrinking the value model from 1.5B to 0.5B trades ~37% theoretical compute for ~20% wall-clock — not a great deal. The system is bottlenecked by per-step sync and CPU prep, so the cheapest production setup is "larger value model, modest k" if quality holds.

Reviewer questions this addresses

  • VVLr: "inference FLOPs are significantly higher" / "for each candidate we need to run k inferences" → answered with layer-level FLOPs (2.10× more, of which 83% is candidate forwards through the value head) vs wall-clock (4.55×).
  • VVLr / Msb9: "wall-clock or throughput analysis" / "end-to-end latency/throughput overhead" → answered in the timing tables.
  • Msb9: "candidate-set construction" → build_pending = 12.25 ms / step; "value-scoring overhead" → apply_guidance = 5.29 ms / step; "value-scoring frequency" → recorded per-step via lvm_active in *.timing.jsonl.
  • Msb9: "batching strategy" → baseline at SGLang continuous batching reaches 180 TFLOPs/s ≈ 18% of H100 bf16 peak. LenVM batching still active but stream sync drops it to 83 TFLOPs/s (~9%).

Out of scope for this PR (follow-up notes)

  • Per-step base-model forward time (currently inferred from e2e - sampler_total; instrumenting ModelRunner.forward_decode or Scheduler.run_batch would split base / LenVM directly).
  • top-k / batch-size sweeps (LENVM_TOP_K env knob is wired; runs are not).
  • CUDA-event timing for async-stream overlap (Python wall-clock currently absorbs implicit syncs).
  • Stronger length-control baselines (prompt-based, EOS calibration, RLVR length training) — separate PR if needed.

Test plan

  • bash scripts/inference/lenvm_timing.sh end-to-end smoke (3 questions / n=4 / max_tokens=1500) — done on 1× H100, produces both JSONL streams + plots.
  • Full 50-question run as reported above — done.
  • Layer-level FLOPs cross-checked against 2*N rule (14.24 vs 15.22 GFLOPs/tok baseline; difference is attention compute + lm_head being separated out).
  • Value-head FLOPs (d*d + d*1) verified against sglang/srt/models/qwen2_lvm.py::MLP2SiLUValueHead.
  • With SGLANG_LVM_TIMING_LOG unset, confirm timer is a no-op and there is no regression in vanilla SGLang behavior.

🤖 Generated with Claude Code

ChangyiYang and others added 3 commits May 22, 2026 20:52
Instruments Sampler.forward and LvmGuidedSampler.apply with a lightweight
Python wall-clock timer that flushes one JSONL record per decoding step when
SGLANG_LVM_TIMING_LOG is set, no-op otherwise.

scripts/inference/lenvm_timing.sh drives two server lifecycles (baseline,
then LenVM in-proc) against the same GSM8K prompt set. inference/timing/
contains the client wrapper and the analysis that emits a CSV table plus
stacked-bar plots for the sampler-side and apply()-internal breakdowns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer VVLr asked for an "inference FLOPs" comparison (paper currently
shows wall-clock implications only). flops.py estimates total FLOPs from
the standard 2*N_params forward-pass rule for the base model + (1+k)
LenVM forward passes per generated token.

analyze.py now emits a second table with GFLOPs/token, total PFLOPs,
achieved TFLOPs/s, and ratios. On the 50-question GSM8K run the
theoretical FLOPs ratio is 2.17x but the measured wall-clock ratio is
4.59x, so half of the LenVM slowdown comes from GPU underutilization
(CPU candidate prep, separate-stream sync) rather than raw extra compute.

Also fixes _summarize_responses to dedupe per-question usage (sample_eval
writes the request's usage on every choice row, so summing every row
inflated total tokens 16x).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the 2*N rule of thumb with a component-level accounting:
* per-layer linear matmuls: Q/K/V/O projections (GQA-aware via num_key_value_heads)
  + SwiGLU MLP (gate, up, down).
* per-layer attention compute: 2*H_q*h*seq_len for each of Q@K^T and attn@V,
  so attention scales with position over the generation trajectory.
* lm_head: 2 * hidden_size * vocab_size, charged per token.

ModelConfig.load(name_or_path) resolves the HF cache or local model dir for
config.json and falls back to hardcoded Qwen2.5 dims when neither is present.
Runs are split into prefill (charged once per unique prompt assuming SGLang
prefix cache is on) and decode (per sample). LenVM-guided runs add one
tree_value_extend + k candidate forwards through the value model per
generated token.

analyze.py prints three tables (timing / FLOPs headline / FLOPs by component)
and emits a new flops_breakdown.png stacked-bar plot. README updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ChangyiYang
Copy link
Copy Markdown
Author

@codex review

ChangyiYang and others added 2 commits May 22, 2026 21:57
Two corrections to the FLOPs estimator after PR review feedback:

1. LenVM checkpoints use MLP2SiLUValueHead (d->d Linear + d->1 Linear, see
   sglang/srt/models/qwen2_lvm.py::MLP2SiLUValueHead), not the base model's
   lm_head (d * vocab_size). For Qwen2.5-1.5B that's ~4.7M FLOPs per token
   instead of ~467M, so the previous accounting was overcharging each LenVM
   forward by ~15%. ModelConfig.head_type is now "lm_head" / "value_head",
   and ModelConfig.load(...) autodetects by checking for value_head.safetensors
   or LengthValueModel-style architectures in config.json.

2. lvm_extra_flops gains a candidate_cost_multiplier=1.0 knob. The default
   matches the current sglang-LenVM in-proc path where each candidate is a
   separate single-token forward sharing only the extended KV cache. A future
   implementation that batches k candidates into one forward and amortizes
   some compute can pass a value < 1.0.

Also fix _find_config_json to look under ./models/<name>/ so the analyzer
can resolve checkpoints downloaded with `hf download --local-dir`.

Re-running on the same 50-q dataset: theoretical LenVM extra drops from
4.48 PFLOPs to 3.82 PFLOPs, ratio drops from 2.30x to 2.10x. Wall-clock
ratio is unchanged at 4.55x, so the GPU-utilization gap widens slightly.

README gains a TL;DR Results section + caveats covering both knobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
inference/timing/sweep_analyze.py reads multiple lenvm_timing.sh result
directories (one per top-k setting) and emits:
- sweep_summary.csv / sweep_summary.json: per-k row with baseline + LenVM
  wall clock, achieved TFLOPs/s, theoretical FLOPs ratio, utilization gap,
  and LenVM apply / forward latency means.
- topk_sweep.png: theoretical FLOPs ratio vs measured wall-clock ratio vs
  achieved TFLOPs/s ratio, plotted against k.

Run on k ∈ {1,2,3,4,5} with the existing 50 q × 16 sample setup:

  k | base_s | lvm_s | wall_ratio | flops_ratio | apply_ms
  --+--------+-------+------------+-------------+---------
  1 |  24.19 | 23.19 |       0.96 |        1.30 |       —    (greedy fast path, LenVM skipped)
  2 |  23.22 | 80.37 |       3.46 |        1.55 |   45.93
  3 |  27.21 | 81.30 |       2.99 |        1.74 |   51.15
  4 |  22.30 | 83.53 |       3.75 |        1.91 |   46.23
  5 |  18.19 | 80.50 |       4.43 |        2.08 |   48.37

Two takeaways from the sweep:
- top-k=1 with temperature=1.0 hits SGLang's is_all_greedy branch in
  Sampler.forward and the LenVM apply() hook is never invoked. The paper
  config really starts at k=2.
- From k=2 to k=5 the LenVM apply latency is flat (~46-51 ms/step) and the
  LenVM forward latency is flat (~32-34 ms/step), so the measured wall-clock
  is essentially constant in k while theoretical FLOPs grows linearly. The
  bottleneck is the per-step sync/CPU-prep cost, not the candidate-set
  matmul, so increasing k inside this range is roughly free.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant