Skip to content

Conversation

@kip-cxj
Copy link
Contributor

@kip-cxj kip-cxj commented Dec 16, 2025

Motivation

Add stateless communication group. To enable more flexible creation of communication groups and resolve compatibility issues with other programs that also use the torch.distributed.
Current support vllm, while sglang does not yet supprt pyhccl. Which feature depends on add pyhccl in sglang.
If the current approach in accptable, we will provide sglang version soon.

@kip-cxj kip-cxj force-pushed the main branch 2 times, most recently from 1b27b3f to f989a80 Compare December 17, 2025 07:12
@kip-cxj kip-cxj changed the title draft: add collective communication for npu draft: add stateless communication for npu Dec 30, 2025
@x1314aq
Copy link

x1314aq commented Jan 7, 2026

@weixiao-huang @HubertZhang pls review this PR

test both on npu and cuda.

Model Device Info device_type GatherMetas Update (Broadcast) Update (P2P)
Qwen3-8b 8xNvidia-A100 TP4 cuda 0.01s 1.28s (1.46GiB) 7.81s (1.72GiB)
Qwen3-8b 8xAscend-A3 TP4 npu 0.02s 1.37s (1.59GiB) 2.02s (1.47GiB)

test the same model using default torch.distributed module.

Model Device Info device_type GatherMetas Update (Broadcast) Update (P2P)
Qwen3-8b 8xNvidia-A100 TP4 torch 0.01s 1.15s (1.46GiB) 7.68s (1.71GiB)
Qwen3-8b 8xAscend-A3 TP4 torch 0.03s 1.44s (1.59GiB) 3.83s (1.46GiB)

@kip-cxj kip-cxj changed the title draft: add stateless communication for npu feat: Replace torch.distributed with StatelessProcessGroup Jan 8, 2026
@kip-cxj kip-cxj changed the title feat: Replace torch.distributed with StatelessProcessGroup feat: add StatelessProcessGroup to extend collective library Jan 8, 2026
@weixiao-huang
Copy link
Collaborator

It seems this PR should depend on vLLM, this is so heavy and not an elegant way. I think ps.py should be a lightweight component, which may not depend on other heavy framework

@hanhan-networking
Copy link

It seems this PR should depend on vLLM, this is so heavy and not an elegant way. I think ps.py should be a lightweight component, which may not depend on other heavy framework

默认的还是通信方式还是torch.distributed诶,只有需要跨资源的时候才需要用到StatelessProcessGroup,如果不支持这个的话,没法合入到verl呀😆 ,不支持训推分离的架构

@HubertZhang
Copy link
Collaborator

是否应当设计一个 protocol DistrubutedLib,给 ps 传入一个 dist: DistributedLib 比较好一些?目前这个 import 的写法感觉隔离的还不太够?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants