Skip to content

Store MoE routing indices per-block in KVBlockAllocator for prefix caching sharing#2

Open
lmcafee-nvidia wants to merge 6 commits into
sidsingh-nvidia:siddharth/support-nemo-rl-router-replayfrom
lmcafee-nvidia:prefix-caching-router-record
Open

Store MoE routing indices per-block in KVBlockAllocator for prefix caching sharing#2
lmcafee-nvidia wants to merge 6 commits into
sidsingh-nvidia:siddharth/support-nemo-rl-router-replayfrom
lmcafee-nvidia:prefix-caching-router-record

Conversation

@lmcafee-nvidia
Copy link
Copy Markdown

@lmcafee-nvidia lmcafee-nvidia commented Mar 19, 2026

Summary

  • Moves MoE routing index storage from per-request (request.routing_indices) to per-block in KVBlockAllocator.block_routing, so routing data is shared alongside KV cache blocks when requests share a prefix.
  • After each step, _store_routing_per_block scatters routing indices into the allocator using the context's token-to-block mapping. At request completion, _reconstruct_routing_from_blocks concatenates block routing data back into a contiguous array.
  • Routing data persists through block release and is only cleared on block re-allocation or allocator reset.
  • Converts per-block routing storage from torch.Tensor to np.ndarray, aligning with the base branch's conversion of routing_indices to np.ndarray.

Test plan

  • All 7 TestPerBlockRouting tests pass (pytest ... ::TestPerBlockRouting -xvs)

🤖 Generated with Claude Code

lmcafee-nvidia and others added 3 commits March 19, 2026 12:24
…ility.

Move routing indices from per-request step-by-step accumulation to
per-block storage on KVBlockAllocator. At request completion, routing
is reconstructed by concatenating per-block routing in block order.
Matched (prefix-cached) blocks retain routing from the original request,
so reconstruction naturally covers all tokens including skipped prefixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Aligns with PR NVIDIA#3925's conversion of routing_indices to np.ndarray.
Changes block_routing dict, store/get/reconstruct methods, and
_store_routing_per_block to use numpy throughout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lmcafee-nvidia lmcafee-nvidia force-pushed the prefix-caching-router-record branch from 69a0403 to 2d90296 Compare March 19, 2026 19:26
@lmcafee-nvidia lmcafee-nvidia changed the title Convert per-block routing storage from Tensor to ndarray Store MoE routing indices per-block in KVBlockAllocator for prefix caching sharing Apr 6, 2026
lmcafee-nvidia and others added 2 commits April 6, 2026 15:00
store_routing_per_block and reconstruct_routing_from_blocks operate on
allocator state and belong alongside the existing store/get primitives.
Controller and engine call sites now delegate to the allocator.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
store_routing_per_block expects a dict mapping request_id to routing
ndarray, but _router_record_bookkeeping was returning a plain list.
This caused "ambiguous truth value" errors when checking membership.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant