Store MoE routing indices per-block in KVBlockAllocator for prefix caching sharing by lmcafee-nvidia · Pull Request #2 · sidsingh-nvidia/Megatron-LM

lmcafee-nvidia · 2026-03-19T19:20:53Z

Summary

Moves MoE routing index storage from per-request (request.routing_indices) to per-block in KVBlockAllocator.block_routing, so routing data is shared alongside KV cache blocks when requests share a prefix.
After each step, _store_routing_per_block scatters routing indices into the allocator using the context's token-to-block mapping. At request completion, _reconstruct_routing_from_blocks concatenates block routing data back into a contiguous array.
Routing data persists through block release and is only cleared on block re-allocation or allocator reset.
Converts per-block routing storage from torch.Tensor to np.ndarray, aligning with the base branch's conversion of routing_indices to np.ndarray.

Test plan

All 7 TestPerBlockRouting tests pass (pytest ... ::TestPerBlockRouting -xvs)

🤖 Generated with Claude Code

…ility. Move routing indices from per-request step-by-step accumulation to per-block storage on KVBlockAllocator. At request completion, routing is reconstructed by concatenating per-block routing in block order. Matched (prefix-cached) blocks retain routing from the original request, so reconstruction naturally covers all tokens including skipped prefixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Aligns with PR NVIDIA#3925's conversion of routing_indices to np.ndarray. Changes block_routing dict, store/get/reconstruct methods, and _store_routing_per_block to use numpy throughout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ter-record

store_routing_per_block and reconstruct_routing_from_blocks operate on allocator state and belong alongside the existing store/get primitives. Controller and engine call sites now delegate to the allocator. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

store_routing_per_block expects a dict mapping request_id to routing ndarray, but _router_record_bookkeeping was returning a plain list. This caused "ambiguous truth value" errors when checking membership. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lmcafee-nvidia and others added 3 commits March 19, 2026 12:24

Assert on routing/token count mismatch instead of silent return

306e8d8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lmcafee-nvidia force-pushed the prefix-caching-router-record branch from 69a0403 to 2d90296 Compare March 19, 2026 19:26

Merge siddharth/support-nemo-rl-router-replay into prefix-caching-rou…

386f7c5

…ter-record

lmcafee-nvidia changed the title ~~Convert per-block routing storage from Tensor to ndarray~~ Store MoE routing indices per-block in KVBlockAllocator for prefix caching sharing Apr 6, 2026

lmcafee-nvidia and others added 2 commits April 6, 2026 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store MoE routing indices per-block in KVBlockAllocator for prefix caching sharing#2

Store MoE routing indices per-block in KVBlockAllocator for prefix caching sharing#2
lmcafee-nvidia wants to merge 6 commits into
sidsingh-nvidia:siddharth/support-nemo-rl-router-replayfrom
lmcafee-nvidia:prefix-caching-router-record

lmcafee-nvidia commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lmcafee-nvidia commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lmcafee-nvidia commented Mar 19, 2026 •

edited

Loading