Skip to content

Add per-step routing overhead timing to output JSON#12

Open
lmcafee-nvidia wants to merge 1 commit into
prefix-caching-router-recordfrom
prefix-caching-router-record-timing
Open

Add per-step routing overhead timing to output JSON#12
lmcafee-nvidia wants to merge 1 commit into
prefix-caching-router-recordfrom
prefix-caching-router-record-timing

Conversation

@lmcafee-nvidia
Copy link
Copy Markdown
Owner

Summary

  • Adds temporary per-step timing instrumentation for MoE routing replay operations
  • Times 4 operations: routing_gather (_router_record_bookkeeping), routing_store (store_routing_per_block), routing_reconstruct (reconstruct_routing_from_blocks), routing_finalize (finalize_routing_chunks)
  • Outputs a "step_details" array in the JSON with per-step prefill/decode request counts, step time, and all 4 routing timings

Key finding

With CUDA graphs enabled on nano-v3, routing_gather dominates decode overhead at ~22ms/step (74% of decode step time). The bottleneck is .cpu().numpy() in _router_record_bookkeeping which forces a full CUDA sync, blocking until the async CUDA graph replay completes.

Test plan

  • Tested with hybrid-100m-moe-tp2 (no CUDA graphs): routing overhead 1.52% of decode
  • Tested with nano-v3 (no CUDA graphs): routing overhead 0.44% of decode
  • Tested with nano-v3 (CUDA graphs): routing overhead 74.17% of decode

🤖 Generated with Claude Code

Temporary instrumentation to measure MoE routing replay overhead.
Times four operations per step: routing_gather (bookkeeping + GPU-to-CPU),
routing_store (per-block distribution), routing_reconstruct (block
reassembly on completion), and routing_finalize (chunk concatenation).

Output is a "step_details" array in the JSON with per-step prefill/decode
counts, step time, and all four routing timings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant