-
Notifications
You must be signed in to change notification settings - Fork 73
feat: add StatelessProcessGroup to extend collective library #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1b27b3f to
f989a80
Compare
2. cache uuid in inference engine
|
@weixiao-huang @HubertZhang pls review this PR test both on npu and cuda.
test the same model using default torch.distributed module.
|
|
It seems this PR should depend on vLLM, this is so heavy and not an elegant way. I think |
默认的还是通信方式还是torch.distributed诶,只有需要跨资源的时候才需要用到StatelessProcessGroup,如果不支持这个的话,没法合入到verl呀😆 ,不支持训推分离的架构 |
|
是否应当设计一个 protocol DistrubutedLib,给 ps 传入一个 dist: DistributedLib 比较好一些?目前这个 import 的写法感觉隔离的还不太够? |
Motivation
Add stateless communication group. To enable more flexible creation of communication groups and resolve compatibility issues with other programs that also use the
torch.distributed.Current support vllm, while sglang does not yet supprt
pyhccl. Which feature depends on add pyhccl in sglang.If the current approach in accptable, we will provide sglang version soon.