Skip to content

Comments

feat: add hybridep#1333

Open
hemildesai wants to merge 4 commits intomainfrom
hemil/hybrid-ep-2
Open

feat: add hybridep#1333
hemildesai wants to merge 4 commits intomainfrom
hemil/hybrid-ep-2

Conversation

@hemildesai
Copy link
Contributor

@hemildesai hemildesai commented Feb 19, 2026

Wandb - https://wandb.ai/Nemo-automodel/automodel-moe-dispatcher

This PR adds HybridEP support for MoE token dispatch and updates dependency/docker wiring needed to run it reliably.

Changelog

  • Add HybridEP backend support to MoE flex token dispatch:
    • Introduce _HybridEPManager in nemo_automodel/components/moe/megatron/token_dispatcher.py.
    • Add moe_flex_dispatcher_backend ("deepep" or "hybridep"), plus backend-specific SM settings.
    • Add HybridEP preprocessing path that converts top-k indices to multihot routing metadata.
  • Extend fused all-to-all utilities in nemo_automodel/components/moe/megatron/fused_a2a.py:
    • Add set_deepep_num_sms.
    • Add HybridEP dispatch/combine autograd wrappers and buffer initialization/reset helpers.
  • Extend backend config and MoE wiring:
    • BackendConfig.dispatcher now accepts "hybridep".
    • Add dispatcher_num_sms.
    • Treat "hybridep" as valid with te/gmm experts.
    • Pass dispatcher backend + SM settings through MoE, GroupedExpertsDeepEP, and GroupedExpertsTE.
  • Add unit tests for HybridEP paths:
    • tests/unit_tests/moe/test_backend_config.py
    • tests/unit_tests/moe/test_experts.py
    • tests/unit_tests/moe/test_layers.py
  • Update dependencies:
    • Bump deep_ep to 7febc6e25660af0f54d95dd781ecdcd62265ecca (v1.2.1+7febc6e metadata).
    • Add Linux-only override: nvidia-cudnn-cu12==9.19.0.56; sys_platform == 'linux'.
    • Update uv.lock accordingly.
  • Fix Docker update behavior:
    • docker/common/update_pyproject_pytorch.sh now removes existing override-dependencies and reinserts docker/common/uv-pytorch.toml under
      [tool.uv] to avoid duplicate/incorrect override blocks.

Additional Information

Signed-off-by: Hemil Desai <hemild@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@hemildesai
Copy link
Contributor Author

/ok to test 8d0be12

hemildesai and others added 3 commits February 19, 2026 07:15
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
@hemildesai
Copy link
Contributor Author

/ok to test 6233747

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support HybridEP

1 participant