feat: add StatelessProcessGroup to extend collective library #66

kip-cxj · 2025-12-16T11:55:41Z

Motivation

Add stateless communication group. To enable more flexible creation of communication groups and resolve compatibility issues with other programs that also use the torch.distributed.
Current support vllm, while sglang does not yet supprt pyhccl. Which feature depends on add pyhccl in sglang.
If the current approach in accptable, we will provide sglang version soon.

2. cache uuid in inference engine

x1314aq · 2026-01-07T10:27:49Z

@weixiao-huang @HubertZhang pls review this PR

test both on npu and cuda.

Model	Device Info	device_type	GatherMetas	Update (Broadcast)	Update (P2P)
Qwen3-8b	8xNvidia-A100 TP4	cuda	0.01s	1.28s (1.46GiB)	7.81s (1.72GiB)
Qwen3-8b	8xAscend-A3 TP4	npu	0.02s	1.37s (1.59GiB)	2.02s (1.47GiB)

test the same model using default torch.distributed module.

Model	Device Info	device_type	GatherMetas	Update (Broadcast)	Update (P2P)
Qwen3-8b	8xNvidia-A100 TP4	torch	0.01s	1.15s (1.46GiB)	7.68s (1.71GiB)
Qwen3-8b	8xAscend-A3 TP4	torch	0.03s	1.44s (1.59GiB)	3.83s (1.46GiB)

weixiao-huang · 2026-01-09T07:18:16Z

It seems this PR should depend on vLLM, this is so heavy and not an elegant way. I think ps.py should be a lightweight component, which may not depend on other heavy framework

hanhan-networking · 2026-01-09T08:45:12Z

It seems this PR should depend on vLLM, this is so heavy and not an elegant way. I think ps.py should be a lightweight component, which may not depend on other heavy framework

默认的还是通信方式还是torch.distributed诶，只有需要跨资源的时候才需要用到StatelessProcessGroup，如果不支持这个的话，没法合入到verl呀😆 ，不支持训推分离的架构

HubertZhang · 2026-01-09T13:23:23Z

是否应当设计一个 protocol DistrubutedLib，给 ps 传入一个 dist: DistributedLib 比较好一些？目前这个 import 的写法感觉隔离的还不太够？

kip-cxj force-pushed the main branch 2 times, most recently from 1b27b3f to f989a80 Compare December 17, 2025 07:12

This was referenced Dec 18, 2025

[wip][BREAKING][recipe, ckpt]add checkpoint engine for one step off policy volcengine/verl#4601

Open

是否可以接受引入torch.distributed以外的集合通信库？ #71

Open

kip-cxj changed the title ~~draft: add collective communication for npu~~ draft: add stateless communication for npu Dec 30, 2025

kip-cxj and others added 5 commits January 6, 2026 10:40

1. add collective communication for npu

0d1e7c9

2. cache uuid in inference engine

add statelesscommgroup

4aa4097

fix bugs

77f7b57

implement PyNcclCommunicatorEx

5266ac1

fix rebase issues

533bc5d

x1314aq force-pushed the main branch from 313ba09 to 533bc5d Compare January 6, 2026 02:47

yexin added 8 commits January 6, 2026 11:14

split distributed.py into distributed_nccl.py & distributed_hccl.py

c7303b9

fix ncclBroadcast illegal memory access

981ec10

export distributed functions

166f819

fix bugs

f00d6d6

fix bugs

68ef1f6

modify ps.py

e381678

fix import error

f2c0ae8

add missing global statement

8218a3b

kip-cxj changed the title ~~draft: add stateless communication for npu~~ feat: Replace torch.distributed with StatelessProcessGroup Jan 8, 2026

kip-cxj changed the title ~~feat: Replace torch.distributed with StatelessProcessGroup~~ feat: add StatelessProcessGroup to extend collective library Jan 8, 2026

use dist.device instead of dist.rank

c3badb4

add distributed abstraction

d9bc500

x1314aq force-pushed the main branch from 3a3db95 to d9bc500 Compare January 10, 2026 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add StatelessProcessGroup to extend collective library #66

feat: add StatelessProcessGroup to extend collective library #66

Uh oh!

kip-cxj commented Dec 16, 2025 •

edited

Loading

Uh oh!

x1314aq commented Jan 7, 2026 •

edited

Loading

Uh oh!

weixiao-huang commented Jan 9, 2026

Uh oh!

hanhan-networking commented Jan 9, 2026

Uh oh!

HubertZhang commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat: add StatelessProcessGroup to extend collective library #66

Are you sure you want to change the base?

feat: add StatelessProcessGroup to extend collective library #66

Uh oh!

Conversation

kip-cxj commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

x1314aq commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weixiao-huang commented Jan 9, 2026

Uh oh!

hanhan-networking commented Jan 9, 2026

Uh oh!

HubertZhang commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kip-cxj commented Dec 16, 2025 •

edited

Loading

x1314aq commented Jan 7, 2026 •

edited

Loading