Add per-step routing overhead timing to output JSON by lmcafee-nvidia · Pull Request #12 · lmcafee-nvidia/Megatron-LM

lmcafee-nvidia · 2026-04-07T15:48:12Z

Summary

Adds temporary per-step timing instrumentation for MoE routing replay operations
Times 4 operations: routing_gather (_router_record_bookkeeping), routing_store (store_routing_per_block), routing_reconstruct (reconstruct_routing_from_blocks), routing_finalize (finalize_routing_chunks)
Outputs a "step_details" array in the JSON with per-step prefill/decode request counts, step time, and all 4 routing timings

Key finding

With CUDA graphs enabled on nano-v3, routing_gather dominates decode overhead at ~22ms/step (74% of decode step time). The bottleneck is .cpu().numpy() in _router_record_bookkeeping which forces a full CUDA sync, blocking until the async CUDA graph replay completes.

Test plan

Tested with hybrid-100m-moe-tp2 (no CUDA graphs): routing overhead 1.52% of decode
Tested with nano-v3 (no CUDA graphs): routing overhead 0.44% of decode
Tested with nano-v3 (CUDA graphs): routing overhead 74.17% of decode

🤖 Generated with Claude Code

Temporary instrumentation to measure MoE routing replay overhead. Times four operations per step: routing_gather (bookkeeping + GPU-to-CPU), routing_store (per-block distribution), routing_reconstruct (block reassembly on completion), and routing_finalize (chunk concatenation). Output is a "step_details" array in the JSON with per-step prefill/decode counts, step time, and all four routing timings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-step routing overhead timing to output JSON#12

Add per-step routing overhead timing to output JSON#12
lmcafee-nvidia wants to merge 1 commit into
prefix-caching-router-recordfrom
prefix-caching-router-record-timing

lmcafee-nvidia commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lmcafee-nvidia commented Apr 7, 2026

Summary

Key finding

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant