Add LenVM inference timing analysis (baseline vs LenVM-guided) by ChangyiYang · Pull Request #2 · UCSB-AI/Length-Value-Model

ChangyiYang · 2026-05-22T20:53:23Z

Summary

Adds an end-to-end + per-decoding-step latency comparison between vanilla SGLang sampling and LenVM-guided sampling, motivated by the LenVM paper review's repeated ask for inference overhead numbers (VVLr / Msb9 / iyxs: wall-clock cost, candidate-set / value-scoring breakdown, FLOPs, batching behavior).

sglang-LenVM/python/sglang/srt/lvm/timing.py: env-gated (SGLANG_LVM_TIMING_LOG) per-step JSONL logger; no-op when unset.
Instruments Sampler.forward (pre-LVM / LVM apply / sample sections) and LvmGuidedSampler.apply (build_pending / LVM forward / apply_guidance).
scripts/inference/lenvm_timing.sh: orchestrator that runs two server lifecycles (baseline → LenVM in-proc) against the same GSM8K prompt set.
inference/timing/run_timing.py: wraps sample_eval and records wall-clock + nvidia-smi samples + token-count summary.
inference/timing/flops.py: layer-level theoretical FLOPs estimator that reads each model's HF config.json and counts Q/K/V/O projections (GQA-aware), SwiGLU MLP, position-dependent attention (Q@K^T + attn@V), and the top-of-stack head separately. ModelConfig.head_type distinguishes the base model's lm_head (2*d*V vocab projection) from a LenVM checkpoint's MLP2SiLUValueHead (d*d + d*1); autodetected by looking for value_head.safetensors or LengthValueModel-style architectures next to config.json. Runs are split into prefill (charged once per unique prompt; assumes SGLang prefix cache is on) and decode (per sample). LenVM-guided runs add one tree_value_extend + k candidate forwards per generated token. candidate_cost_multiplier defaults to 1.0 (current sglang-LenVM in-proc behavior, where each candidate is a separate single-token forward sharing only the extended KV cache); set lower if a future implementation batches the k candidates.
inference/timing/analyze.py: aggregates both JSONL streams + theoretical FLOPs; emits summary.csv / summary.json, three stdout tables, and three plots.

Results

Setup: 1× H100 SXM, Qwen2.5-7B-Instruct + lvm-math-0402-a-qwen2.5-7b-instruct-b-qwen2.5-1.5b-instruct, GSM8K 50 questions × 16 samples, max_tokens=6000, T=1.0, top-p=1.0, baseline top-k=-1, LenVM top-k=5 value-scale=0 (paper config).

End-to-end + sampler-side per-step

metric	baseline	LenVM	ratio
end-to-end wall clock (s)	19.22	87.44	4.55× slower
throughput (output tok/s)	12,366	2,719	4.55× drop
total output tokens	237,712	237,724	~same
mean tokens / question	4,754	4,754	~same
sampler steps	1,463	1,386	~same
mean `Sampler.forward` (ms / step)	0.19	50.31	268×
├ pre-LVM (preprocess + softmax)	0.07	0.11	~same
├ LenVM `apply()` outer	0.00	50.03	—
└ sample kernel	0.12	0.17	~same

LenVM apply() internal decomposition (mean ms / step):

section	mean ms	share of apply()
`build_pending` (CPU candidate + prefix prep)	12.25	23%
LenVM forward (extend + launch + collect on 1.5B value model)	36.58	67%
`apply_guidance` (CPU value → probs writeback)	5.29	10%

Layer-level theoretical FLOPs vs measured wall-clock (k=5)

flops.py reads each model's HF config.json and counts matmuls layer by layer: Q/K/V/O projections, SwiGLU MLP, position-dependent attention (Q@K^T + attn@V), and head (vocab projection for the base model, MLP2SiLUValueHead for the LenVM checkpoint — d*d + d*1, not d*V). LenVM-extra FLOPs = LenVM prefill (per unique prompt) + 1 tree_value_extend + k = 5 candidate forwards per output token (matching LENVM_TOP_K=5 in the run config above).

metric	baseline	LenVM	ratio
per-output-token cost (GFLOPs)	14.24	30.24	2.12× more FLOPs
total compute (PFLOPs)	3.46	7.28	2.10×
achieved throughput (TFLOPs/s)	179.9	83.2	0.46×
H100 bf16 peak utilization	~18%	~9%	—
wall-clock	19.22 s	87.44 s	4.55×

Component decomposition (PFLOPs):

	base.linear	base.attention	base.lm_head	lvm.extend	lvm.candidates	lvm.prefill
baseline	3.17	0.02	0.26	—	—	—
LenVM	3.17	0.02	0.26	0.63	3.17	0.01

The theoretical FLOPs ratio is 2.10× (each generated token drives k=5 LenVM candidate evaluations + 1 LenVM extend in addition to the base forward; ~83% of the LenVM-only FLOPs are candidate forwards through the 1.5B value model with a tiny MLP2SiLUValueHead). The measured wall-clock ratio is 4.55×. Roughly half of the LenVM slowdown is genuine extra compute; the other half is GPU underutilization: the base sampler stream stalls on CPU candidate prep, the LenVM forward runs on a separate CUDA stream with sync overhead, and the LenVM forward's effective batch is smaller than the base model's. Achieved TFLOPs/s drops from ~18% of H100 bf16 peak to ~9%.

Top-k ablation (k ∈ {1,2,3,4,5})

Same setup as above; only LENVM_TOP_K varies.

k	baseline wall (s)	LenVM wall (s)	wall_ratio	theoretical FLOPs ratio	LenVM apply (ms/step)	LenVM forward (ms/step)
1	24.19	23.19	0.96	1.30	—	—
2	23.22	80.37	3.46	1.55	45.93	32.18
3	27.21	81.30	2.99	1.74	51.15	34.16
4	22.30	83.53	3.75	1.91	46.23	33.87
5	18.19	80.50	4.43	2.08	48.37	33.49

Two findings from the sweep:

k=1 is degenerate. With top_k=1 and temperature=1.0 SGLang takes the is_all_greedy fast path inside Sampler.forward and never enters the LenVM apply() hook (all 1193 recorded decode steps have is_greedy=true, lvm_active=false). The reported "k=1" row above is therefore not LenVM with a one-element candidate set; it is the base 7B running greedy decoding while the LenVM weights sit idle on the GPU. Empirical LenVM measurements start at k ≥ 2.
k=2 → k=5 LenVM wall-clock is essentially flat (~80–84 s), even though theoretical FLOPs grow from 1.55× to 2.08×. Per-step LenVM apply latency is 46–51 ms across the range; LenVM forward latency is 32–34 ms. Doubling the candidate-set size from 2 to 5 only moves wall-clock by a few percent. The bottleneck inside LvmGuidedSampler.apply is per-step sync + CPU candidate prep, not the LenVM forward batch. Practically: paper-configured k=5 is no more expensive in wall clock than k=2, so increasing k inside this range is roughly free.

Aggregator: inference/timing/sweep_analyze.py reads multiple result dirs and emits sweep_summary.csv/json + topk_sweep.png. The cluster-side wrapper that drives the sweep (scripts/_run_timing_topk_sweep.sh) is not in this PR because it bakes in local CUDA paths.

0.5B vs 1.5B LenVM ablation

Re-running the same k ∈ {1..5} sweep with the smaller lvm-a-qwen2.5-7b-instruct-b-qwen2.5-0.5b-instruct value model to isolate model-size vs sync-overhead contributions.

metric	1.5B (k=5)	0.5B (k=5)	0.5B / 1.5B
LenVM wall-clock (s)	80.50	65.27	0.81×
theoretical FLOPs ratio over baseline	2.08	1.31	—
LenVM apply (ms/step)	48.37	38.45	0.79×
LenVM forward (ms/step)	33.49	23.14	0.69×
achieved TFLOPs/s (LenVM)	90.78	69.42	0.77×

Per-k LenVM wall-clock for both models (k=1 omitted; degenerate greedy path):

k	1.5B lvm_s	1.5B apply ms	0.5B lvm_s	0.5B apply ms
2	80.4	45.9	63.3	38.9
3	81.3	51.2	60.2	37.6
4	83.5	46.2	62.2	38.6
5	80.5	48.4	65.3	38.5

Findings:

Switching the value model from 1.5B → 0.5B (~3× fewer params) only saves ~19% LenVM wall-clock (80 → 65 s) and ~21% per-step apply latency (48 → 38 ms). The per-step LenVM forward itself drops more (33 → 23 ms, ~30%), but the CPU-side build_pending + apply_guidance (~15 ms) is unchanged across model sizes, so it dominates a larger share at smaller models.
The 0.5B utilization gap is wider than the 1.5B's (wall_ratio / flops_ratio = 3.23 / 1.31 = 2.46× for 0.5B vs 4.43 / 2.08 = 2.13× for 1.5B). Reducing the value-model size moves the workload further into memory-bound / sync-bound territory; raw compute is not the limiting factor.
The k=2 → k=5 plateau holds at 0.5B too: apply latency ~38 ms across k, wall-clock 60–65 s.

Practical implication: shrinking the value model from 1.5B to 0.5B trades ~37% theoretical compute for ~20% wall-clock — not a great deal. The system is bottlenecked by per-step sync and CPU prep, so the cheapest production setup is "larger value model, modest k" if quality holds.

Reviewer questions this addresses

VVLr: "inference FLOPs are significantly higher" / "for each candidate we need to run k inferences" → answered with layer-level FLOPs (2.10× more, of which 83% is candidate forwards through the value head) vs wall-clock (4.55×).
VVLr / Msb9: "wall-clock or throughput analysis" / "end-to-end latency/throughput overhead" → answered in the timing tables.
Msb9: "candidate-set construction" → build_pending = 12.25 ms / step; "value-scoring overhead" → apply_guidance = 5.29 ms / step; "value-scoring frequency" → recorded per-step via lvm_active in *.timing.jsonl.
Msb9: "batching strategy" → baseline at SGLang continuous batching reaches 180 TFLOPs/s ≈ 18% of H100 bf16 peak. LenVM batching still active but stream sync drops it to 83 TFLOPs/s (~9%).

Out of scope for this PR (follow-up notes)

Per-step base-model forward time (currently inferred from e2e - sampler_total; instrumenting ModelRunner.forward_decode or Scheduler.run_batch would split base / LenVM directly).
top-k / batch-size sweeps (LENVM_TOP_K env knob is wired; runs are not).
CUDA-event timing for async-stream overlap (Python wall-clock currently absorbs implicit syncs).
Stronger length-control baselines (prompt-based, EOS calibration, RLVR length training) — separate PR if needed.

Test plan

bash scripts/inference/lenvm_timing.sh end-to-end smoke (3 questions / n=4 / max_tokens=1500) — done on 1× H100, produces both JSONL streams + plots.
Full 50-question run as reported above — done.
Layer-level FLOPs cross-checked against 2*N rule (14.24 vs 15.22 GFLOPs/tok baseline; difference is attention compute + lm_head being separated out).
Value-head FLOPs (d*d + d*1) verified against sglang/srt/models/qwen2_lvm.py::MLP2SiLUValueHead.
With SGLANG_LVM_TIMING_LOG unset, confirm timer is a no-op and there is no regression in vanilla SGLang behavior.

🤖 Generated with Claude Code

Instruments Sampler.forward and LvmGuidedSampler.apply with a lightweight Python wall-clock timer that flushes one JSONL record per decoding step when SGLANG_LVM_TIMING_LOG is set, no-op otherwise. scripts/inference/lenvm_timing.sh drives two server lifecycles (baseline, then LenVM in-proc) against the same GSM8K prompt set. inference/timing/ contains the client wrapper and the analysis that emits a CSV table plus stacked-bar plots for the sampler-side and apply()-internal breakdowns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer VVLr asked for an "inference FLOPs" comparison (paper currently shows wall-clock implications only). flops.py estimates total FLOPs from the standard 2*N_params forward-pass rule for the base model + (1+k) LenVM forward passes per generated token. analyze.py now emits a second table with GFLOPs/token, total PFLOPs, achieved TFLOPs/s, and ratios. On the 50-question GSM8K run the theoretical FLOPs ratio is 2.17x but the measured wall-clock ratio is 4.59x, so half of the LenVM slowdown comes from GPU underutilization (CPU candidate prep, separate-stream sync) rather than raw extra compute. Also fixes _summarize_responses to dedupe per-question usage (sample_eval writes the request's usage on every choice row, so summing every row inflated total tokens 16x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the 2*N rule of thumb with a component-level accounting: * per-layer linear matmuls: Q/K/V/O projections (GQA-aware via num_key_value_heads) + SwiGLU MLP (gate, up, down). * per-layer attention compute: 2*H_q*h*seq_len for each of Q@K^T and attn@V, so attention scales with position over the generation trajectory. * lm_head: 2 * hidden_size * vocab_size, charged per token. ModelConfig.load(name_or_path) resolves the HF cache or local model dir for config.json and falls back to hardcoded Qwen2.5 dims when neither is present. Runs are split into prefill (charged once per unique prompt assuming SGLang prefix cache is on) and decode (per sample). LenVM-guided runs add one tree_value_extend + k candidate forwards through the value model per generated token. analyze.py prints three tables (timing / FLOPs headline / FLOPs by component) and emits a new flops_breakdown.png stacked-bar plot. README updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ChangyiYang · 2026-05-22T21:44:17Z

@codex review

Two corrections to the FLOPs estimator after PR review feedback: 1. LenVM checkpoints use MLP2SiLUValueHead (d->d Linear + d->1 Linear, see sglang/srt/models/qwen2_lvm.py::MLP2SiLUValueHead), not the base model's lm_head (d * vocab_size). For Qwen2.5-1.5B that's ~4.7M FLOPs per token instead of ~467M, so the previous accounting was overcharging each LenVM forward by ~15%. ModelConfig.head_type is now "lm_head" / "value_head", and ModelConfig.load(...) autodetects by checking for value_head.safetensors or LengthValueModel-style architectures in config.json. 2. lvm_extra_flops gains a candidate_cost_multiplier=1.0 knob. The default matches the current sglang-LenVM in-proc path where each candidate is a separate single-token forward sharing only the extended KV cache. A future implementation that batches k candidates into one forward and amortizes some compute can pass a value < 1.0. Also fix _find_config_json to look under ./models/<name>/ so the analyzer can resolve checkpoints downloaded with `hf download --local-dir`. Re-running on the same 50-q dataset: theoretical LenVM extra drops from 4.48 PFLOPs to 3.82 PFLOPs, ratio drops from 2.30x to 2.10x. Wall-clock ratio is unchanged at 4.55x, so the GPU-utilization gap widens slightly. README gains a TL;DR Results section + caveats covering both knobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

inference/timing/sweep_analyze.py reads multiple lenvm_timing.sh result directories (one per top-k setting) and emits: - sweep_summary.csv / sweep_summary.json: per-k row with baseline + LenVM wall clock, achieved TFLOPs/s, theoretical FLOPs ratio, utilization gap, and LenVM apply / forward latency means. - topk_sweep.png: theoretical FLOPs ratio vs measured wall-clock ratio vs achieved TFLOPs/s ratio, plotted against k. Run on k ∈ {1,2,3,4,5} with the existing 50 q × 16 sample setup: k | base_s | lvm_s | wall_ratio | flops_ratio | apply_ms --+--------+-------+------------+-------------+--------- 1 | 24.19 | 23.19 | 0.96 | 1.30 | — (greedy fast path, LenVM skipped) 2 | 23.22 | 80.37 | 3.46 | 1.55 | 45.93 3 | 27.21 | 81.30 | 2.99 | 1.74 | 51.15 4 | 22.30 | 83.53 | 3.75 | 1.91 | 46.23 5 | 18.19 | 80.50 | 4.43 | 2.08 | 48.37 Two takeaways from the sweep: - top-k=1 with temperature=1.0 hits SGLang's is_all_greedy branch in Sampler.forward and the LenVM apply() hook is never invoked. The paper config really starts at k=2. - From k=2 to k=5 the LenVM apply latency is flat (~46-51 ms/step) and the LenVM forward latency is flat (~32-34 ms/step), so the measured wall-clock is essentially constant in k while theoretical FLOPs grows linearly. The bottleneck is the per-step sync/CPU-prep cost, not the candidate-set matmul, so increasing k inside this range is roughly free. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ChangyiYang and others added 3 commits May 22, 2026 20:52

ChangyiYang and others added 2 commits May 22, 2026 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LenVM inference timing analysis (baseline vs LenVM-guided)#2

Add LenVM inference timing analysis (baseline vs LenVM-guided)#2
ChangyiYang wants to merge 5 commits into
UCSB-AI:mainfrom
ChangyiYang:lenvm-timing-analysis

ChangyiYang commented May 22, 2026 •

edited

Loading

Uh oh!

ChangyiYang commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChangyiYang commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

End-to-end + sampler-side per-step

Layer-level theoretical FLOPs vs measured wall-clock (k=5)

Top-k ablation (k ∈ {1,2,3,4,5})

0.5B vs 1.5B LenVM ablation

Reviewer questions this addresses

Out of scope for this PR (follow-up notes)

Test plan

Uh oh!

ChangyiYang commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChangyiYang commented May 22, 2026 •

edited

Loading