Skip to content

Conversation

@mitu626
Copy link

@mitu626 mitu626 commented Jan 15, 2026

Motivation

  1. RL异步异置场景通过RDMA权重原地更新能力。
  2. 在FastDeploy内部,支持 api-server <--> engine <--> worker 之间控制信号通信机制。

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

  1. 控制信号通信:
    • api-server <--> engine:复用当前zmq通信机制,新增ControlRequest、ControlResponse类型。
    • engine ---> worker:复用当前通信机制(engine_worker_queue),插入 ControlRequest 请求
    • engine <--- worker:基于 fmq.Queue 新增每个TP rank和engine的通信queue,支持每个TP rank返回自己的执行结果 ControlResponse
  2. 新增接口
    • /v1/pause: 推理引擎不再响应推理请求,当前正在执行的推理请求会被打断。
    • /v1/update_weights: 基于RDMA远程同步权重,并原地更新权重
    • /v1/resume:回归推理引擎继续相应推理请求
    • /v1/is_paused: 查看当前推理引擎状态

其他说明:

  1. 当前控制信号执行会阻塞推理请求的处理,在权重更新场景下本身需要pause,因此无影响,后续需要升级支持可选异步,不阻塞推理请求。
  2. 当前worker ---> engine 控制信号返回信道和推理请求输出结果的返回信道独立,后续需要考虑是否可以合并以简化系统。

Usage or Command

  1. 服务启动时,新增参数 --load-strategy "rsync" --rsync-config '{"xxxx":"xxxx"}'
  2. 接口调用:

说明:

  • rsync需要的模型版本,默认从模型路径下的文件内读取 {model}/version.txt,支持调用接口时通过参数指定
  • rsync依赖的config支持在启动时传入默认配置参数,也支持调用接口时通过参数指定

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@paddle-bot
Copy link

paddle-bot bot commented Jan 15, 2026

Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants