Conversation
a88dd6f to
f0f385a
Compare
|
🤖 AI Code Review — PR #778: feat: support py flashinfer decode cuda graph and remove c++ flashinfer op 概述This PR enables CUDA Graph support for the Python FlashInfer decode path and removes the now-redundant C++ FlashInfer decode/prefill ops and their pybind registrations. It also fixes a crash in the C++ CUDA Graph runner where 优点
建议改进P1 - 重要
P2 - 建议
P3 - Nit
总结Well-structured PR that follows established patterns. The C++ op removal is clean with no dangling references. Main ask: add test coverage for the new decode CUDA graph path before merging. |
|
🤖 AI Code Review — PR #778 feat: support py flashinfer decode cuda graph and remove c++ flashinfer op 概述PR 为 Python FlashInfer decode 路径添加 CUDA Graph 支持,同时移除已被 Python 实现完全替代的 C++ FlashInfer ops。还修复了 C++ CUDA Graph runner 中 优点
建议改进P1 - 重要
P2 - 建议
总结结构清晰,模式一致。主要风险是新 decode CUDA graph 路径缺少测试。建议补充测试后合入。 |
|
🤖 AI Code Review — PR #778 SummaryMajor refactor: removes C++ FlashInfer decode/prefill ops (-468 lines), promotes Python FlashInfer decode to support CUDA graph capture/replay, and fixes CUDA graph decode crash when Findings
Excellent cleanup overall. The CUDA graph support with |
f0f385a to
b84c82f
Compare
There was a problem hiding this comment.
Pull request overview
This PR migrates FlashInfer attention to the Python (flashinfer-python) path for decode CUDA graph support, while removing the legacy C++ FlashInfer op/bindings from the codebase.
Changes:
- Removed C++ FlashInfer op implementation and pybind registrations, and updated Bazel targets accordingly.
- Updated PyFlashinfer attention implementations to support decode CUDA graph replay by avoiding
plan()during replay and reusing preallocated page-table buffers. - Adjusted CUDA graph runner handling of
prefix_lengthsto treat decode CUDA graph mode as “defined but empty”.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| rtp_llm/ops/rtp_llm_ops.pyi | Removes FlashInfer C++ op stubs from the public typing surface. |
| rtp_llm/models_py/modules/factory/attention/cuda_impl/py_flashinfer_mha.py | Adds/rewires PyFlashinfer decode CUDA graph replay support and shifts params ownership to caller-managed objects. |
| rtp_llm/models_py/modules/factory/attention/cuda_impl/flash_infer.py | Deletes legacy Python wrapper selecting the C++ FlashInfer ops. |
| rtp_llm/models_py/modules/factory/attention/init.py | Stops registering removed C++ FlashInfer impls and adjusts PyFlashinfer impl registration. |
| rtp_llm/models_py/bindings/cuda/ops/CudaFlashInfer.h | Removes C++ FlashInfer paged-attn params/impl header. |
| rtp_llm/models_py/bindings/cuda/ops/CudaFlashInfer.cc | Removes C++ FlashInfer paged-attn params/impl source. |
| rtp_llm/models_py/bindings/cuda/ops/BUILD | Updates CUDA bindings build graph to drop removed sources and relocate flashinfer deps. |
| rtp_llm/models_py/bindings/cuda/RegisterBaseBindings.hpp | Removes FlashInfer op binding include. |
| rtp_llm/models_py/bindings/cuda/RegisterAttnOpBindings.hpp | Removes FlashInfer op registration call. |
| rtp_llm/models_py/bindings/cuda/FlashInferOp.h | Deletes C++ FlashInfer op wrapper header. |
| rtp_llm/models_py/bindings/cuda/FlashInferOp.cc | Deletes C++ FlashInfer op wrapper source. |
| rtp_llm/models_py/bindings/core/ExecOps.cc | Removes now-dead include of CudaFlashInfer. |
| rtp_llm/cpp/cuda_graph/cuda_graph_runner.cc | Treats decode CUDA graph prefix lengths as empty and guards size-based logic. |
| rtp_llm/cpp/cuda_graph/cuda_graph_decode.cc | Guards prefix-length max computation when prefix_lengths is empty. |
Comments suppressed due to low confidence (1)
rtp_llm/ops/rtp_llm_ops.pyi:37
rtp_llm_ops.pyireferencestyping.Anyin signatures, but the moduletypingisn't imported (onlyAnyis imported). This breaks stub parsing/type-checking. Importtypingor switch these annotations to use the already-importedAnydirectly.
class SelectTopkOp:
def __init__(
self, attn_configs: typing.Any
) -> None: ...
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -98,7 +98,7 @@ def __init__( | |||
| def __del__(self): | |||
| release_py_flashinfer_workspace_buffer(self.g_workspace_buffer) | |||
|
|
|||
| def set_params(self, params: Any): | |||
| def set_params(self, params: rtp_llm_ops.FlashInferMlaAttnParams): | |||
| """Set the params object to be used by this op.""" | |||
| self.fmha_params = params | |||
There was a problem hiding this comment.
PyFlashinferPrefillPagedAttnOp now initializes fmha_params to None, but prepare() calls self.fmha_params.fill_params(...) without guarding against None. This will raise an AttributeError if set_params() wasn't called (and also violates the prepare() return type). Add an explicit assert self.fmha_params is not None (or initialize it in __init__) similar to the other ops in this file.
| self.seq_size_per_block, | ||
| forbid_realloc=True, | ||
| ) | ||
|
|
There was a problem hiding this comment.
CUDA-graph decode replay behavior is now specialized via prepare_for_cuda_graph_replay() / prepare_cuda_graph() to avoid calling plan() during replay. There are CUDA graph tests in rtp_llm/cpp/cuda_graph/tests/, but none appear to explicitly exercise PyFlashinferDecodeImpl (it can be preempted by higher-priority decode impls). Add/adjust a test to force selection of PyFlashinferDecodeImpl in CUDA graph decode mode and validate replay works (no plan during replay, correct outputs).
| def prepare_cuda_graph(self, attn_inputs: PyAttentionInputs) -> None: | |
| """CUDA-graph replay entry point. | |
| Keep decode replay on the specialized no-plan path by delegating to | |
| prepare_for_cuda_graph_replay(). This avoids any fallback to the | |
| generic prepare() flow during replay. | |
| """ | |
| self.prepare_for_cuda_graph_replay(attn_inputs) |
AI Code Review — PR #778feat: support py flashinfer decode cuda graph and remove c++ flashinfer op 概要将 decode 阶段的 FlashInfer attention 从 C++ 实现迁移到 Python(py_flashinfer),并支持 decode CUDA graph。删除 ~1100 行 C++ 代码,新增 问题
结论存在阻塞/重要问题,不建议合入。需修复 P0( |
|
CI dispatcher could not find a native This can happen if the PR was opened before the CI architecture change, or if the original run was deleted. To fix: push any commit (even empty: |
b84c82f to
ca732cf
Compare
🤖 AI Code Review — PR #778feat: support py flashinfer decode cuda graph and remove c++ flashinfer op 核心改动
P2: GatherBatchScheduler 行为变更
P3: fmha_params 初始化
结论✅ LGTM ready to ci — 无阻塞级问题。代码精简效果显著,CUDA graph 支持设计合理。 |
|
internal source has been updated, please review the changes! |
ca732cf to
e4464d0
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.fmha_params: Optional[rtp_llm_ops.FlashInferMlaAttnParams] = None | ||
| self.enable_cuda_graph = attn_inputs.is_cuda_graph |
There was a problem hiding this comment.
PyFlashinferPrefillPagedAttnOp now initializes fmha_params to None, but prepare() still dereferences self.fmha_params without any guard in this class. To avoid runtime AttributeError when set_params() is missed, add an explicit assertion/check before using fmha_params (as done in other ops in this file).
| // Defer the gather until running streams drain so the next batch is pure prefill. | ||
| const bool python_model_busy = !running_streams_.empty(); |
There was a problem hiding this comment.
This change makes python_model_busy true whenever running_streams_ is non-empty, regardless of model_specific_config_.load_python_model. That alters GatherBatchScheduler behavior for non-Python models (it will never gather new streams while any stream is running), which can reduce batching/throughput. If the intent is to guard only the PyWrappedModel path, restore the load_python_model condition or rename the variable/comment to reflect the new global behavior.
| // Defer the gather until running streams drain so the next batch is pure prefill. | |
| const bool python_model_busy = !running_streams_.empty(); | |
| // Defer gather only for python models until running streams drain so the next batch is pure prefill. | |
| const bool python_model_busy = | |
| model_specific_config_.load_python_model && !running_streams_.empty(); |
| cu_seqlens.slice(0, 1, max_bs_ + 1) = input_lengths.cumsum(0); | ||
| if (prefix_lengths.defined()) { | ||
| if (prefix_lengths.defined() && prefix_lengths.size(0) > 0) { | ||
| cu_kv_seqlens.slice(0, 1, max_bs_ + 1) = input_lengths.add(prefix_lengths).cumsum(0); |
There was a problem hiding this comment.
When prefix_lengths is an empty tensor (decode CUDA graph mode), cu_kv_seqlens is left as all zeros because it’s only computed in the prefix_lengths.size(0) > 0 branch. This likely produces incorrect cu_kv_seqlens for decode (many kernels/tests assume cu_kv_seqlens at least matches cu_seqlens when no prefix lengths are present). Consider adding an else that sets cu_kv_seqlens from input_lengths.cumsum(0) (or copies from cu_seqlens) when prefix lengths are empty.
| cu_kv_seqlens.slice(0, 1, max_bs_ + 1) = input_lengths.add(prefix_lengths).cumsum(0); | |
| cu_kv_seqlens.slice(0, 1, max_bs_ + 1) = input_lengths.add(prefix_lengths).cumsum(0); | |
| } else { | |
| cu_kv_seqlens.slice(0, 1, max_bs_ + 1) = input_lengths.cumsum(0); |
| inputs.attention_inputs.prefix_lengths_d = inputs.attention_inputs.prefix_lengths.cuda(); | ||
| } else { | ||
| // Decode CUDA graph mode: prefix_lengths should be empty tensor | ||
| inputs.attention_inputs.prefix_lengths = torch::empty({0}, options_cpu_int32_).pin_memory(); |
There was a problem hiding this comment.
In decode CUDA graph mode this sets prefix_lengths to an empty CPU tensor but does not initialize the corresponding device mirror prefix_lengths_d. Downstream Python code frequently consumes attn_inputs.prefix_lengths_d; leaving it undefined can cause runtime errors. Consider initializing prefix_lengths_d to an empty CUDA int32 tensor in this branch to keep host/device mirrors consistent.
| inputs.attention_inputs.prefix_lengths = torch::empty({0}, options_cpu_int32_).pin_memory(); | |
| inputs.attention_inputs.prefix_lengths = torch::empty({0}, options_cpu_int32_).pin_memory(); | |
| inputs.attention_inputs.prefix_lengths_d = torch::empty({0}, options_cuda_int32_); |
| capture_mem_hold_.py_model_inputs_.attention_inputs.prefix_lengths_d.slice(0, 0, batch_size); | ||
| } else { | ||
| // For decode CUDA graph mode: prefix_lengths is empty tensor | ||
| inputs.attention_inputs.prefix_lengths = capture_mem_hold_.py_model_inputs_.attention_inputs.prefix_lengths; |
There was a problem hiding this comment.
In decode CUDA graph mode (prefix_lengths is empty), this branch assigns inputs.attention_inputs.prefix_lengths but leaves inputs.attention_inputs.prefix_lengths_d unset, while other code paths assume the device mirror exists. If prefix_lengths_d is initialized as an empty CUDA tensor during capture, it should also be threaded through here (e.g., assign/slice it alongside prefix_lengths).
| inputs.attention_inputs.prefix_lengths = capture_mem_hold_.py_model_inputs_.attention_inputs.prefix_lengths; | |
| inputs.attention_inputs.prefix_lengths = capture_mem_hold_.py_model_inputs_.attention_inputs.prefix_lengths; | |
| inputs.attention_inputs.prefix_lengths_d = | |
| capture_mem_hold_.py_model_inputs_.attention_inputs.prefix_lengths_d; |
| from rtp_llm.models_py.modules.factory.attention.cuda_impl.py_flashinfer_mha import ( | ||
| PyFlashinferDecodeImpl, | ||
| PyFlashinferPagedPrefillImpl, | ||
| PyFlashinferPrefillImpl, | ||
| ) | ||
|
|
||
| PREFILL_MHA_IMPS.append(PyFlashinferPrefillImpl) | ||
| PREFILL_MHA_IMPS.append(PyFlashinferPagedPrefillImpl) | ||
| DECODE_MHA_IMPS.append(PyFlashinferDecodeImpl) |
There was a problem hiding this comment.
In the CUDA device path, PyFlashinferPrefillImpl/PyFlashinferDecodeImpl are already imported and registered in the if device_type == DeviceType.Cuda: block above. This second unconditional import/append will register duplicates (and also runs on non-CUDA device types), which can affect impl priority/selection. Consider removing this block for CUDA, or gating it to non-CUDA devices only.
e4464d0 to
fe2075c
Compare
|
/retest |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| assert ( | ||
| self.fmha_params is not None | ||
| ), "fmha_params must be set before calling prepare" |
There was a problem hiding this comment.
Using assert for required runtime checks is unsafe because asserts can be stripped with Python -O, potentially leading to a later None dereference. Replace this with an explicit exception (e.g., RuntimeError/ValueError) to guarantee the check is enforced in production.
| assert ( | |
| self.fmha_params is not None | |
| ), "fmha_params must be set before calling prepare" | |
| if self.fmha_params is None: | |
| raise RuntimeError("fmha_params must be set before calling prepare") |
| assert ( | ||
| self.fmha_params is not None | ||
| ), "fmha_params must be set before calling prepare" |
There was a problem hiding this comment.
Same issue as above: assert may be optimized away. Use an explicit exception to enforce that set_params() was called before prepare().
| assert ( | |
| self.fmha_params is not None | |
| ), "fmha_params must be set before calling prepare" | |
| if self.fmha_params is None: | |
| raise RuntimeError("fmha_params must be set before calling prepare") |
| if self.enable_cuda_graph and self.decode_wrapper._fixed_batch_size is None: | ||
| self.decode_wrapper._use_cuda_graph = True | ||
| self.decode_wrapper._paged_kv_indptr_buf = ( | ||
| self.fmha_params.decode_page_indptr_d | ||
| ) | ||
| self.decode_wrapper._paged_kv_last_page_len_buf = ( | ||
| self.fmha_params.paged_kv_last_page_len_d | ||
| ) | ||
| self.decode_wrapper._paged_kv_indices_buf = self.fmha_params.page_indice_d | ||
| self.decode_wrapper._fixed_batch_size = attn_inputs.input_lengths.size(0) |
There was a problem hiding this comment.
This directly mutates private/internal fields on decode_wrapper (e.g., _paged_kv_indptr_buf, _fixed_batch_size). That is brittle: changes in FlashInfer wrapper internals can silently break CUDA graph support. Prefer adding/using a dedicated public method on the wrapper (e.g., enable_cuda_graph_buffers(indptr, indices, last_page_len, fixed_bs)) or a small adapter layer in your codebase that centralizes this wiring in one place.
| """ | ||
| Prepare the decode wrapper with paged KV cache parameters. | ||
|
|
||
| forbid_realloc: True only when called from prepare_cuda_graph (replay); forbids buffer realloc. |
There was a problem hiding this comment.
The docstring says forbid_realloc=True is for CUDA graph replay, but replay is now handled by prepare_for_cuda_graph_replay() (which also enforces no re-plan). Either update this docstring to reflect current call paths, or remove/privatize forbid_realloc if it’s no longer part of the intended public behavior.
| forbid_realloc: True only when called from prepare_cuda_graph (replay); forbids buffer realloc. | |
| forbid_realloc: When True, prevents buffer reallocation while filling | |
| params. CUDA graph replay is handled via | |
| ``prepare_for_cuda_graph_replay()``. |
| // Defer the gather until running streams drain so the next batch is pure prefill. | ||
| const bool python_model_busy = !running_streams_.empty(); | ||
| if (waiting_streams_.size() >= static_cast<size_t>(gather_batch_size_) && !python_model_busy) { |
There was a problem hiding this comment.
This changes behavior for all models (not just PyWrappedModel): gathering is now deferred whenever running_streams_ is non-empty, which can reduce throughput and increase tail latency by preventing batching while decode is in progress. If the constraint is specific to Python-wrapped execution (mixed prefill+decode shape mismatch), consider gating this behind a config flag (e.g., load_python_model or a dedicated disallow_mixed_prefill_decode_batches) so non-Python/C++ paths can retain previous scheduling behavior.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.max_seq_len = attn_configs.max_seq_len | ||
| self.fmha_params = rtp_llm_ops.FlashInferMlaAttnParams() | ||
| self.fmha_params: Optional[rtp_llm_ops.FlashInferMlaAttnParams] = None | ||
| self.enable_cuda_graph = attn_inputs.is_cuda_graph |
There was a problem hiding this comment.
PyFlashinferPrefillPagedAttnOp.fmha_params is changed to be optional/initialized to None, but prepare() later dereferences it (calls self.fmha_params.fill_params(...)) without a local guard. This makes the class fragile: any caller that invokes prepare() before set_params() will hit an AttributeError. Consider either initializing fmha_params in __init__ (as before) or adding an explicit precondition check in prepare() similar to the other ops in this module.
9316f6f to
48a958d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 34 out of 34 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // prefix_lengths [batch_size, int32] (for attention `prepare`) | ||
| if (num_tokens_per_bs_ > 1 && !is_prefill_cuda_graph_mode_) { | ||
| inputs.attention_inputs.prefix_lengths = | ||
| torch::full({int(max_bs_)}, max_seq_len_ - num_tokens_per_bs_, options_cpu_int32_).pin_memory(); | ||
| inputs.attention_inputs.prefix_lengths_d = inputs.attention_inputs.prefix_lengths.cuda(); | ||
| } else if (is_prefill_cuda_graph_mode_) { | ||
| inputs.attention_inputs.prefix_lengths = torch::zeros({int(max_bs_)}, options_cpu_int32_).pin_memory(); | ||
| inputs.attention_inputs.prefix_lengths_d = inputs.attention_inputs.prefix_lengths.cuda(); | ||
| } else { | ||
| // Decode CUDA graph mode: prefix_lengths should be empty tensor | ||
| inputs.attention_inputs.prefix_lengths = torch::empty({0}, options_cpu_int32_).pin_memory(); | ||
| } |
There was a problem hiding this comment.
In decode CUDA-graph mode you now set prefix_lengths to an empty host tensor, but you don't initialize the corresponding device mirror attention_inputs.prefix_lengths_d. PyAttentionInputs has a concrete prefix_lengths_d field and some Python model code accesses attention_inputs.prefix_lengths_d; leaving it undefined can cause runtime failures. Consider setting prefix_lengths_d to an empty CUDA tensor here (to match non-graph decode behavior), and update the CUDA-graph replay copy logic to skip/zero-copy prefix_lengths_d when it is empty so it doesn't memcpy batch_size*sizeof(int) into a 0-sized buffer.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 34 out of 34 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 35 out of 35 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… graph - Remove C++ FlashInfer ops (CudaFlashInfer, FlashInferOp) and old Python wrapper (flash_infer.py) - Trim 3rdparty/flashinfer BUILD to sampling/norm/activation kernels only - Add decode CUDA graph support to PyFlashinferDecodeAttnOp with buffer management - Add comprehensive decode UT covering correctness and CUDA graph buffer validation - Remove C++ FlashInfer decode graph path from cuda_graph_runner - Fix fmha_params auto-creation in constructors (restore original author design) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PyWrappedModel is now the default model path so the prefill+decode mixing guard should always apply. Make python_model_busy depend only on running_streams_ being non-empty, and update GatherBatchSchedulerTest to drop the now-dead load_python_model parameter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ugins, fa_deps, deep_ep_deps) - Delete 3rdparty/aiter/ directory and all related BUILD/config references - Remove flash_attention, aiter_src from open_source/deps/git.bzl - Remove fa_deps(), deep_ep_deps(), kernel_so_deps(), arpc_deps(), trt_plugins() from arch_select.bzl - Delete cufmha/ directory and nv_trt_plugins/BUILD stub - Remove aiter link dep from rocm BUILD and decode_aiter_attn from OpData.h - Clean up aiter genrules and copy targets from rtp_llm/libs/BUILD Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FlashInfer's BatchDecodeWithPagedKVCacheWrapper requires _qo_indptr_buf when both use_tensor_cores and _use_cuda_graph are True. We were setting _use_cuda_graph manually after construction without creating this buffer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
064182b to
e04b793
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 35 out of 35 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
🤖 AI Code Review — PR #778 PR 概述Title: 核心目标
Review 意见问题
亮点
整体评价代码质量良好,所有上次阻塞问题已修复。P0=0, P1=0, P2=1, P3=2。 LGTM ready to ci — 当前 review 未发现阻塞级或重要级问题,可进入 CI 验证和合入流程。 lgtm ready to ci |
|
internal source has been updated, please review the changes! |
No description provided.