fix: Skip cudagraph capture at prefillWarmup stage#805
fix: Skip cudagraph capture at prefillWarmup stage#805bppps wants to merge 1 commit intoalibaba:mainfrom
Conversation
|
🤖 AI Code Review 概述修复 Qwen35 等模型在 优点
建议改进P1 - 重要
P2 - 建议
总结crash fix 正确且必要。建议修复不变量问题(设 |
4a70e5d to
c3343ee
Compare
…is passed nullptr but strongly required by linear attention in qwen35 based models Signed-off-by: bppps <bpppsaka@gmail.com>
c3343ee to
071aae0
Compare
已参考评审意见修复 |
|
🤖 AI Code Review (v2 — PR #805 fix: Skip cudagraph capture at prefillWarmup stage 增量审查) Verdict: Approve — v2 changes address both P1 findings from the initial review. Delta Summary (3d5afb7 → 071aae0)Changed files:
v1 问题修复状态
v2 新增问题无。 正确性验证
建议跟进
|
|
🤖 AI Code Review — PR #805 SummaryFixes a crash/incorrect behavior during the prefill warmup phase where CUDA graph capture was attempted without a valid FindingsNo significant issues found. The change is well-scoped:
Clean fix, no concerns. |
|
🤖 AI Code Review — PR #805 PR 概述Title: 核心目标修复 Review 意见问题
整体评价干净、聚焦的 bug fix PR。问题根因分析清晰,修复方案合理,WARNING 日志对潜在风险做了充分说明。PR description 质量高。 LGTM ready to ci — 当前 review 未发现阻塞级或重要级问题,可进入 CI 验证和合入流程;P2 建议后续改进但不阻塞。 |
AI Code Review — PR #805Summary: P0/0 · P1/0 · P2/0 · P3/0 Review status: LGTM lgtm ready to ci Strengths
|
Problems:
NormalEngine initialization goes through warmup -> initCacheManager -> initExecutor procedure, presented below.
rtp-llm/rtp_llm/cpp/normal_engine/NormalEngine.cc
Lines 69 to 82 in 1ae04a3
At warmup stage, KVCacheManager is set nullptr during NormalExecutor initialization, which results in PyModelWrapper created with no KVCache enabled. If we launched rtp_llm with cudagraph enabled, then it will will try to capture decode graph in prefill warmup.
rtp-llm/rtp_llm/cpp/normal_engine/NormalEngine.cc
Line 208 in 1ae04a3
rtp-llm/rtp_llm/cpp/normal_engine/NormalExecutor.cc
Lines 66 to 69 in 1ae04a3
Besides if we are using hybrid attention models like Qwen35, kv cache is now required strictly by default.
rtp-llm/rtp_llm/models_py/model_desc/qwen3_next.py
Line 399 in 1ae04a3
launch command:
Results in stack trace below :
what(): AssertionError: kv_cache is required for decode
At:
/home/admin/rtp-llm/rtp_llm/models_py/model_desc/qwen3_next.py(399): forward
/opt/conda310/lib/python3.10/site-packages/torch/nn/modules/module.py(1750): _call_impl
/opt/conda310/lib/python3.10/site-packages/torch/nn/modules/module.py(1739): _wrapped_call_impl
/home/admin/rtp-llm/rtp_llm/models_py/model_desc/qwen3_next.py(572): forward
/opt/conda310/lib/python3.10/site-packages/torch/nn/modules/module.py(1750): _call_impl
/opt/conda310/lib/python3.10/site-packages/torch/nn/modules/module.py(1739): _wrapped_call_impl
/home/admin/rtp-llm/rtp_llm/models_py/model_desc/qwen3_next.py(654): forward
/opt/conda310/lib/python3.10/site-packages/torch/nn/modules/module.py(1750): _call_impl
/opt/conda310/lib/python3.10/site-packages/torch/nn/modules/module.py(1739): _wrapped_call_impl
/home/admin/rtp-llm/rtp_llm/models_py/model_desc/qwen3_next.py(749): forward
Modifications:
This skips cudagraph capture by examining kv_cache_layer nullptr which is default in warmup. And it may need consideration that potential inaccuracies exists in VRAM measurement, which are within acceptable tolerance and problems can be reported in engine setup if KVCacheManager requires a larger space.