feat: support py flashinfer decode cuda graph and remove c++ flashinfer op by JackTan25 · Pull Request #778 · alibaba/rtp-llm

JackTan25 · 2026-03-13T08:47:29Z

No description provided.

LLLLKKKK · 2026-03-24T07:02:47Z

🤖 AI Code Review — PR #778: feat: support py flashinfer decode cuda graph and remove c++ flashinfer op

概述

This PR enables CUDA Graph support for the Python FlashInfer decode path and removes the now-redundant C++ FlashInfer decode/prefill ops and their pybind registrations. It also fixes a crash in the C++ CUDA Graph runner where prefix_lengths.max() was called on a defined-but-empty tensor in decode mode.

优点

Follows the established CUDA graph pattern from the prefill paged path (PyFlashinferPagedPrefillImpl) consistently
Clean removal of C++ ops — internal C++ FlashInfer infrastructure (CudaFlashInfer.h/cc) is correctly preserved for the C++ attention path
Good defensive fix for prefix_lengths empty tensor in CudaGraphDecode.cc and CudaGraphRunner.cc
Type annotations improved (set_params uses concrete type instead of Any/ParamsBase)
Net code reduction (~150 lines removed)

建议改进

P1 - 重要

No test coverage for the new decode CUDA graph path
This PR adds support_cuda_graph() and prepare_cuda_graph() to PyFlashinferDecodeImpl, enabling CUDA graph capture/replay for FlashInfer decode. There are no new tests exercising this path. Given the known fragility of CUDA graph paths, at minimum a smoke test that exercises capture → replay with varying batch sizes would be valuable.
__init__.py registration order change — confirm no regression
The C++ FlashInferDecodeImpl/FlashInferPrefillImpl were registered at a specific position in the priority list (after TRT/TRTLLM but before the Py FlashInfer impls). Removing them changes the effective priority. The Py impls were already lower priority than TRT/TRTLLM/XQA, so this is likely fine, but please confirm no model configuration relied on the C++ FlashInfer impls being selected over the Py ones.

P2 - 建议

fill_mla_params import is now unused in py_flashinfer_mha.py
After switching PyFlashinferDecodeAttnOp.prepare() to use self.fmha_params.fill_params(), the fill_mla_params import (line 34) is no longer used anywhere in this file. Should be removed.
prefix_lengths empty tensor propagation to Python fill_params
The C++ side now initializes prefix_lengths = torch::empty({0}) for decode CUDA graph mode. Verify that FlashInferMlaAttnParams::fill_params handles a size-0 prefix_lengths tensor correctly (decode mode likely ignores it, but worth a quick check).

P3 - Nit

PyFlashinferDecodeAttnOp.__init__ signature change
Constructor now requires attn_inputs as a second parameter. Only known caller (PyFlashinferDecodeImpl) is updated. Low risk since it's internal, but note for any downstream consumers.

总结

Well-structured PR that follows established patterns. The C++ op removal is clean with no dangling references. Main ask: add test coverage for the new decode CUDA graph path before merging.

LLLLKKKK · 2026-03-24T07:03:31Z

🤖 AI Code Review — PR #778 feat: support py flashinfer decode cuda graph and remove c++ flashinfer op

概述

PR 为 Python FlashInfer decode 路径添加 CUDA Graph 支持，同时移除已被 Python 实现完全替代的 C++ FlashInfer ops。还修复了 C++ CUDA Graph runner 中 prefix_lengths 空 tensor 调用 .max() 的 crash。

优点

遵循已有的 prefill paged CUDA graph 模式，一致性好
C++ op 移除干净，内部 C++ FlashInfer 基础设施保留（仅移除 Python-facing binding）
prefix_lengths 空 tensor 防御性处理正确

建议改进

P1 - 重要

缺少 decode CUDA graph 端到端测试 — 新增的 support_cuda_graph() 和 prepare_cuda_graph() 路径无测试覆盖。CUDA graph 路径已知脆弱（capture/replay 参数过期等），建议至少添加一个 smoke test 验证 capture → replay 流程。
__init__.py 注册顺序变更 — 移除 C++ FlashInfer impl 后，Py FlashInfer impls 的优先级发生变化。请确认没有模型配置依赖 C++ FlashInfer impl 被优先选中。

P2 - 建议

fill_mla_params import 可能未使用 — PyFlashinferDecodeAttnOp.prepare() 改用 self.fmha_params.fill_params() 后，fill_mla_params 可能不再被调用，建议清理。
prefix_lengths 空 tensor 传播到 Python 侧 — C++ 初始化 torch::empty({0})，需确认 FlashInferMlaAttnParams::fill_params 正确处理 size-0 tensor。

总结

结构清晰，模式一致。主要风险是新 decode CUDA graph 路径缺少测试。建议补充测试后合入。

LLLLKKKK · 2026-04-09T13:32:34Z

🤖 AI Code Review — PR #778
Head SHA: f0f385a3372326120152bab167842cdd096e2037 | Verdict: P2

Summary

Major refactor: removes C++ FlashInfer decode/prefill ops (-468 lines), promotes Python FlashInfer decode to support CUDA graph capture/replay, and fixes CUDA graph decode crash when prefix_lengths is empty. The Python FlashInfer path becomes the sole FlashInfer implementation.

Findings

[P2] Reliance on flashinfer internal attributes for CUDA graph init
PyFlashinferDecodeAttnOp.prepare() accesses decode_wrapper._fixed_batch_size, _use_cuda_graph, _paged_kv_indptr_buf, etc. These are internal implementation details of BatchDecodeWithPagedKVCacheWrapper. A flashinfer library update could break this silently.
[P2] Two-phase initialization pattern (None + set_params)
Multiple classes init fmha_params = None and require set_params() before prepare(). The assert guards are good but the pattern is error-prone. Constructor injection or a factory method would be safer.
[P2] prefix_lengths.size(0) > 0 guard may be needed in more places
The fix adds this check in CudaGraphDecode.cc and CudaGraphRunner.cc, but other code paths may also call .max() on prefix_lengths without this guard. A codebase-wide audit would be prudent.
[P2] Removed C++ ops may have external references
FlashInferDecodeOp, FlashInferPrefillOp, FlashInferAttnParams are removed from the .pyi stubs and registration. Any downstream code referencing these by name will break.

Excellent cleanup overall. The CUDA graph support with forbid_realloc is well-designed, and the prefix_lengths empty tensor fix addresses a real crash.

Copilot

Pull request overview

This PR migrates FlashInfer attention to the Python (flashinfer-python) path for decode CUDA graph support, while removing the legacy C++ FlashInfer op/bindings from the codebase.

Changes:

Removed C++ FlashInfer op implementation and pybind registrations, and updated Bazel targets accordingly.
Updated PyFlashinfer attention implementations to support decode CUDA graph replay by avoiding plan() during replay and reusing preallocated page-table buffers.
Adjusted CUDA graph runner handling of prefix_lengths to treat decode CUDA graph mode as “defined but empty”.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
rtp_llm/ops/rtp_llm_ops.pyi	Removes FlashInfer C++ op stubs from the public typing surface.
rtp_llm/models_py/modules/factory/attention/cuda_impl/py_flashinfer_mha.py	Adds/rewires PyFlashinfer decode CUDA graph replay support and shifts params ownership to caller-managed objects.
rtp_llm/models_py/modules/factory/attention/cuda_impl/flash_infer.py	Deletes legacy Python wrapper selecting the C++ FlashInfer ops.
rtp_llm/models_py/modules/factory/attention/init.py	Stops registering removed C++ FlashInfer impls and adjusts PyFlashinfer impl registration.
rtp_llm/models_py/bindings/cuda/ops/CudaFlashInfer.h	Removes C++ FlashInfer paged-attn params/impl header.
rtp_llm/models_py/bindings/cuda/ops/CudaFlashInfer.cc	Removes C++ FlashInfer paged-attn params/impl source.
rtp_llm/models_py/bindings/cuda/ops/BUILD	Updates CUDA bindings build graph to drop removed sources and relocate flashinfer deps.
rtp_llm/models_py/bindings/cuda/RegisterBaseBindings.hpp	Removes FlashInfer op binding include.
rtp_llm/models_py/bindings/cuda/RegisterAttnOpBindings.hpp	Removes FlashInfer op registration call.
rtp_llm/models_py/bindings/cuda/FlashInferOp.h	Deletes C++ FlashInfer op wrapper header.
rtp_llm/models_py/bindings/cuda/FlashInferOp.cc	Deletes C++ FlashInfer op wrapper source.
rtp_llm/models_py/bindings/core/ExecOps.cc	Removes now-dead include of CudaFlashInfer.
rtp_llm/cpp/cuda_graph/cuda_graph_runner.cc	Treats decode CUDA graph prefix lengths as empty and guards size-based logic.
rtp_llm/cpp/cuda_graph/cuda_graph_decode.cc	Guards prefix-length max computation when prefix_lengths is empty.

Comments suppressed due to low confidence (1)

rtp_llm/ops/rtp_llm_ops.pyi:37

rtp_llm_ops.pyi references typing.Any in signatures, but the module typing isn't imported (only Any is imported). This breaks stub parsing/type-checking. Import typing or switch these annotations to use the already-imported Any directly.

class SelectTopkOp:
    def __init__(
        self, attn_configs: typing.Any
    ) -> None: ...

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T17:23:16Z

@@ -98,7 +98,7 @@ def __init__(
    def __del__(self):
        release_py_flashinfer_workspace_buffer(self.g_workspace_buffer)

-    def set_params(self, params: Any):
+    def set_params(self, params: rtp_llm_ops.FlashInferMlaAttnParams):
        """Set the params object to be used by this op."""
        self.fmha_params = params


PyFlashinferPrefillPagedAttnOp now initializes fmha_params to None, but prepare() calls self.fmha_params.fill_params(...) without guarding against None. This will raise an AttributeError if set_params() wasn't called (and also violates the prepare() return type). Add an explicit assert self.fmha_params is not None (or initialize it in __init__) similar to the other ops in this file.

Copilot · 2026-04-24T17:23:16Z

+            self.seq_size_per_block,
+            forbid_realloc=True,
+        )



CUDA-graph decode replay behavior is now specialized via prepare_for_cuda_graph_replay() / prepare_cuda_graph() to avoid calling plan() during replay. There are CUDA graph tests in rtp_llm/cpp/cuda_graph/tests/, but none appear to explicitly exercise PyFlashinferDecodeImpl (it can be preempted by higher-priority decode impls). Add/adjust a test to force selection of PyFlashinferDecodeImpl in CUDA graph decode mode and validate replay works (no plan during replay, correct outputs).

Suggested change

def prepare_cuda_graph(self, attn_inputs: PyAttentionInputs) -> None:

"""CUDA-graph replay entry point.

Keep decode replay on the specialized no-plan path by delegating to

prepare_for_cuda_graph_replay(). This avoids any fallback to the

generic prepare() flow during replay.

"""

self.prepare_for_cuda_graph_replay(attn_inputs)

LLLLKKKK · 2026-04-25T15:00:35Z

AI Code Review — PR #778

feat: support py flashinfer decode cuda graph and remove c++ flashinfer op

概要

将 decode 阶段的 FlashInfer attention 从 C++ 实现迁移到 Python（py_flashinfer），并支持 decode CUDA graph。删除 ~1100 行 C++ 代码，新增 prepare_for_cuda_graph_replay() 解决 plan() 与 capture stream 的竞态条件。

问题

[P0] __init__.py 残留已删除模块的导入

rtp_llm/models_py/modules/factory/attention/__init__.py 第 118-124 行仍然导入已删除的 flash_infer.py 模块：
```
from rtp_llm.models_py.modules.factory.attention.cuda_impl.flash_infer import (
    FlashInferDecodeImpl,
    FlashInferPrefillImpl,
)
```
运行时会触发 ImportError，导致 attention factory 初始化失败。

建议：删除第 118-124 行。
[P1] PPU 平台悬挂 include

[SENSITIVE] 仍 include 已删除的 CudaFlashInfer.h，PPU 平台构建会编译失败。

建议：检查该 include 是否仍被需要，如不需要则移除。
[P1] Decode CUDA graph 新功能缺少测试

新增的 prepare_cuda_graph、support_cuda_graph、prepare_for_cuda_graph_replay 是核心新功能，但没有对应的单元测试或 smoke 测试。

建议：至少添加一个 smoke 测试验证 decode CUDA graph + py_flashinfer 路径。
[P2] 直接操作 decode_wrapper 内部私有属性

prepare() 中直接设置 decode_wrapper._fixed_batch_size 等私有属性，flashinfer 版本升级时可能变更。建议添加注释说明依赖的 flashinfer 版本。
[P2] PR description 为空

对于删除 ~1200 行 C++ 代码并新增 CUDA graph 支持的 PR，应说明迁移动机和设计决策。

结论

存在阻塞/重要问题，不建议合入。需修复 P0（__init__.py 残留导入）和 P1（PPU include、测试覆盖）后方可合入。

github-actions · 2026-04-26T06:12:07Z

CI dispatcher could not find a native build run for HEAD SHA b84c82f2.

This can happen if the PR was opened before the CI architecture change, or if the original run was deleted.

To fix: push any commit (even empty: git commit --allow-empty -m "trigger CI" && git push) to create a native build run, then re-approve or post lgtm ready to ci.

LLLLKKKK · 2026-04-26T08:56:15Z

🤖 AI Code Review — PR #778

feat: support py flashinfer decode cuda graph and remove c++ flashinfer op

核心改动

移除 C++ FlashInfer ops（FlashInferOp、CudaFlashInfer）和 cufmha 实现（-2288 行）
迁移到 Python FlashInfer，新增 CUDA graph decode 支持
修复 cuda_graph_runner.cc 中 decode 模式 prefix_lengths 处理
简化 GatherBatchScheduler 的 python_model_busy 判断

P2: GatherBatchScheduler 行为变更

GatherBatchScheduler.h 中 python_model_busy 从 model_specific_config_.load_python_model && !running_streams_.empty() 改为 !running_streams_.empty()。这使得所有模型（不仅是 Python 模型）都会等待 running streams 清空后才 gather，可能影响非 Python 模型的吞吐。建议确认这是有意为之。

P3: fmha_params 初始化

py_flashinfer_mha.py 中 self.fmha_params 初始化为 None，但 set_params 的类型标注为 FlashInferMlaAttnParams。若 set_params 未被调用就执行 forward，会 AttributeError。

结论

✅ LGTM ready to ci — 无阻塞级问题。代码精简效果显著，CUDA graph 支持设计合理。

wht21 · 2026-04-26T08:59:25Z

internal source has been updated, please review the changes!

Copilot

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-26T09:09:49Z

+        self.fmha_params: Optional[rtp_llm_ops.FlashInferMlaAttnParams] = None
        self.enable_cuda_graph = attn_inputs.is_cuda_graph


PyFlashinferPrefillPagedAttnOp now initializes fmha_params to None, but prepare() still dereferences self.fmha_params without any guard in this class. To avoid runtime AttributeError when set_params() is missed, add an explicit assertion/check before using fmha_params (as done in other ops in this file).

Copilot · 2026-04-26T09:09:50Z

+        // Defer the gather until running streams drain so the next batch is pure prefill.
+        const bool python_model_busy = !running_streams_.empty();


This change makes python_model_busy true whenever running_streams_ is non-empty, regardless of model_specific_config_.load_python_model. That alters GatherBatchScheduler behavior for non-Python models (it will never gather new streams while any stream is running), which can reduce batching/throughput. If the intent is to guard only the PyWrappedModel path, restore the load_python_model condition or rename the variable/comment to reflect the new global behavior.

Suggested change

// Defer the gather until running streams drain so the next batch is pure prefill.

const bool python_model_busy = !running_streams_.empty();

// Defer gather only for python models until running streams drain so the next batch is pure prefill.

const bool python_model_busy =

model_specific_config_.load_python_model && !running_streams_.empty();

Copilot · 2026-04-26T09:09:50Z

    cu_seqlens.slice(0, 1, max_bs_ + 1) = input_lengths.cumsum(0);
-    if (prefix_lengths.defined()) {
+    if (prefix_lengths.defined() && prefix_lengths.size(0) > 0) {
        cu_kv_seqlens.slice(0, 1, max_bs_ + 1) = input_lengths.add(prefix_lengths).cumsum(0);


When prefix_lengths is an empty tensor (decode CUDA graph mode), cu_kv_seqlens is left as all zeros because it’s only computed in the prefix_lengths.size(0) > 0 branch. This likely produces incorrect cu_kv_seqlens for decode (many kernels/tests assume cu_kv_seqlens at least matches cu_seqlens when no prefix lengths are present). Consider adding an else that sets cu_kv_seqlens from input_lengths.cumsum(0) (or copies from cu_seqlens) when prefix lengths are empty.

Suggested change

cu_kv_seqlens.slice(0, 1, max_bs_ + 1) = input_lengths.add(prefix_lengths).cumsum(0);

cu_kv_seqlens.slice(0, 1, max_bs_ + 1) = input_lengths.add(prefix_lengths).cumsum(0);

} else {

cu_kv_seqlens.slice(0, 1, max_bs_ + 1) = input_lengths.cumsum(0);

Copilot · 2026-04-26T09:09:50Z

        inputs.attention_inputs.prefix_lengths_d = inputs.attention_inputs.prefix_lengths.cuda();
+    } else {
+        // Decode CUDA graph mode: prefix_lengths should be empty tensor
+        inputs.attention_inputs.prefix_lengths = torch::empty({0}, options_cpu_int32_).pin_memory();


In decode CUDA graph mode this sets prefix_lengths to an empty CPU tensor but does not initialize the corresponding device mirror prefix_lengths_d. Downstream Python code frequently consumes attn_inputs.prefix_lengths_d; leaving it undefined can cause runtime errors. Consider initializing prefix_lengths_d to an empty CUDA int32 tensor in this branch to keep host/device mirrors consistent.

Suggested change

inputs.attention_inputs.prefix_lengths = torch::empty({0}, options_cpu_int32_).pin_memory();

inputs.attention_inputs.prefix_lengths = torch::empty({0}, options_cpu_int32_).pin_memory();

inputs.attention_inputs.prefix_lengths_d = torch::empty({0}, options_cuda_int32_);

Copilot · 2026-04-26T09:09:50Z

+                capture_mem_hold_.py_model_inputs_.attention_inputs.prefix_lengths_d.slice(0, 0, batch_size);
+        } else {
+            // For decode CUDA graph mode: prefix_lengths is empty tensor
+            inputs.attention_inputs.prefix_lengths = capture_mem_hold_.py_model_inputs_.attention_inputs.prefix_lengths;


In decode CUDA graph mode (prefix_lengths is empty), this branch assigns inputs.attention_inputs.prefix_lengths but leaves inputs.attention_inputs.prefix_lengths_d unset, while other code paths assume the device mirror exists. If prefix_lengths_d is initialized as an empty CUDA tensor during capture, it should also be threaded through here (e.g., assign/slice it alongside prefix_lengths).

Suggested change

inputs.attention_inputs.prefix_lengths = capture_mem_hold_.py_model_inputs_.attention_inputs.prefix_lengths;

inputs.attention_inputs.prefix_lengths = capture_mem_hold_.py_model_inputs_.attention_inputs.prefix_lengths;

inputs.attention_inputs.prefix_lengths_d =

capture_mem_hold_.py_model_inputs_.attention_inputs.prefix_lengths_d;

Copilot · 2026-04-26T09:09:51Z

    from rtp_llm.models_py.modules.factory.attention.cuda_impl.py_flashinfer_mha import (
        PyFlashinferDecodeImpl,
-        PyFlashinferPagedPrefillImpl,
        PyFlashinferPrefillImpl,
    )

    PREFILL_MHA_IMPS.append(PyFlashinferPrefillImpl)
-    PREFILL_MHA_IMPS.append(PyFlashinferPagedPrefillImpl)
    DECODE_MHA_IMPS.append(PyFlashinferDecodeImpl)


In the CUDA device path, PyFlashinferPrefillImpl/PyFlashinferDecodeImpl are already imported and registered in the if device_type == DeviceType.Cuda: block above. This second unconditional import/append will register duplicates (and also runs on non-CUDA device types), which can affect impl priority/selection. Consider removing this block for CUDA, or gating it to non-CUDA devices only.

LLLLKKKK · 2026-04-26T09:25:29Z

/retest

Copilot

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-26T09:46:30Z

+        assert (
+            self.fmha_params is not None
+        ), "fmha_params must be set before calling prepare"


Using assert for required runtime checks is unsafe because asserts can be stripped with Python -O, potentially leading to a later None dereference. Replace this with an explicit exception (e.g., RuntimeError/ValueError) to guarantee the check is enforced in production.

Suggested change

assert (

self.fmha_params is not None

), "fmha_params must be set before calling prepare"

if self.fmha_params is None:

raise RuntimeError("fmha_params must be set before calling prepare")

Copilot · 2026-04-26T09:46:31Z

+        assert (
+            self.fmha_params is not None
+        ), "fmha_params must be set before calling prepare"


Same issue as above: assert may be optimized away. Use an explicit exception to enforce that set_params() was called before prepare().

Suggested change

assert (

self.fmha_params is not None

), "fmha_params must be set before calling prepare"

if self.fmha_params is None:

raise RuntimeError("fmha_params must be set before calling prepare")

Copilot · 2026-04-26T09:46:31Z

+        if self.enable_cuda_graph and self.decode_wrapper._fixed_batch_size is None:
+            self.decode_wrapper._use_cuda_graph = True
+            self.decode_wrapper._paged_kv_indptr_buf = (
+                self.fmha_params.decode_page_indptr_d
+            )
+            self.decode_wrapper._paged_kv_last_page_len_buf = (
+                self.fmha_params.paged_kv_last_page_len_d
+            )
+            self.decode_wrapper._paged_kv_indices_buf = self.fmha_params.page_indice_d
+            self.decode_wrapper._fixed_batch_size = attn_inputs.input_lengths.size(0)


This directly mutates private/internal fields on decode_wrapper (e.g., _paged_kv_indptr_buf, _fixed_batch_size). That is brittle: changes in FlashInfer wrapper internals can silently break CUDA graph support. Prefer adding/using a dedicated public method on the wrapper (e.g., enable_cuda_graph_buffers(indptr, indices, last_page_len, fixed_bs)) or a small adapter layer in your codebase that centralizes this wiring in one place.

Copilot · 2026-04-26T09:46:31Z

+        """
+        Prepare the decode wrapper with paged KV cache parameters.
+
+        forbid_realloc: True only when called from prepare_cuda_graph (replay); forbids buffer realloc.


The docstring says forbid_realloc=True is for CUDA graph replay, but replay is now handled by prepare_for_cuda_graph_replay() (which also enforces no re-plan). Either update this docstring to reflect current call paths, or remove/privatize forbid_realloc if it’s no longer part of the intended public behavior.

Suggested change

forbid_realloc: True only when called from prepare_cuda_graph (replay); forbids buffer realloc.

forbid_realloc: When True, prevents buffer reallocation while filling

params. CUDA graph replay is handled via

``prepare_for_cuda_graph_replay()``.

Copilot · 2026-04-26T09:46:31Z

+        // Defer the gather until running streams drain so the next batch is pure prefill.
+        const bool python_model_busy = !running_streams_.empty();
        if (waiting_streams_.size() >= static_cast<size_t>(gather_batch_size_) && !python_model_busy) {


This changes behavior for all models (not just PyWrappedModel): gathering is now deferred whenever running_streams_ is non-empty, which can reduce throughput and increase tail latency by preventing batching while decode is in progress. If the constraint is specific to Python-wrapped execution (mixed prefill+decode shape mismatch), consider gating this behind a config flag (e.g., load_python_model or a dedicated disallow_mixed_prefill_decode_batches) so non-Python/C++ paths can retain previous scheduling behavior.

Copilot

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-26T10:16:58Z

        self.max_seq_len = attn_configs.max_seq_len
-        self.fmha_params = rtp_llm_ops.FlashInferMlaAttnParams()
+        self.fmha_params: Optional[rtp_llm_ops.FlashInferMlaAttnParams] = None
        self.enable_cuda_graph = attn_inputs.is_cuda_graph


PyFlashinferPrefillPagedAttnOp.fmha_params is changed to be optional/initialized to None, but prepare() later dereferences it (calls self.fmha_params.fill_params(...)) without a local guard. This makes the class fragile: any caller that invokes prepare() before set_params() will hit an AttributeError. Consider either initializing fmha_params in __init__ (as before) or adding an explicit precondition check in prepare() similar to the other ops in this module.

Copilot

Pull request overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-26T11:48:19Z

    // prefix_lengths [batch_size, int32] (for attention `prepare`)
    if (num_tokens_per_bs_ > 1 && !is_prefill_cuda_graph_mode_) {
        inputs.attention_inputs.prefix_lengths =
            torch::full({int(max_bs_)}, max_seq_len_ - num_tokens_per_bs_, options_cpu_int32_).pin_memory();
        inputs.attention_inputs.prefix_lengths_d = inputs.attention_inputs.prefix_lengths.cuda();
    } else if (is_prefill_cuda_graph_mode_) {
        inputs.attention_inputs.prefix_lengths   = torch::zeros({int(max_bs_)}, options_cpu_int32_).pin_memory();
        inputs.attention_inputs.prefix_lengths_d = inputs.attention_inputs.prefix_lengths.cuda();
+    } else {
+        // Decode CUDA graph mode: prefix_lengths should be empty tensor
+        inputs.attention_inputs.prefix_lengths = torch::empty({0}, options_cpu_int32_).pin_memory();
    }


In decode CUDA-graph mode you now set prefix_lengths to an empty host tensor, but you don't initialize the corresponding device mirror attention_inputs.prefix_lengths_d. PyAttentionInputs has a concrete prefix_lengths_d field and some Python model code accesses attention_inputs.prefix_lengths_d; leaving it undefined can cause runtime failures. Consider setting prefix_lengths_d to an empty CUDA tensor here (to match non-graph decode behavior), and update the CUDA-graph replay copy logic to skip/zero-copy prefix_lengths_d when it is empty so it doesn't memcpy batch_size*sizeof(int) into a 0-sized buffer.

Copilot

Pull request overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 35 out of 35 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… graph - Remove C++ FlashInfer ops (CudaFlashInfer, FlashInferOp) and old Python wrapper (flash_infer.py) - Trim 3rdparty/flashinfer BUILD to sampling/norm/activation kernels only - Add decode CUDA graph support to PyFlashinferDecodeAttnOp with buffer management - Add comprehensive decode UT covering correctness and CUDA graph buffer validation - Remove C++ FlashInfer decode graph path from cuda_graph_runner - Fix fmha_params auto-creation in constructors (restore original author design) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PyWrappedModel is now the default model path so the prefill+decode mixing guard should always apply. Make python_model_busy depend only on running_streams_ being non-empty, and update GatherBatchSchedulerTest to drop the now-dead load_python_model parameter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ugins, fa_deps, deep_ep_deps) - Delete 3rdparty/aiter/ directory and all related BUILD/config references - Remove flash_attention, aiter_src from open_source/deps/git.bzl - Remove fa_deps(), deep_ep_deps(), kernel_so_deps(), arpc_deps(), trt_plugins() from arch_select.bzl - Delete cufmha/ directory and nv_trt_plugins/BUILD stub - Remove aiter link dep from rocm BUILD and decode_aiter_attn from OpData.h - Clean up aiter genrules and copy targets from rtp_llm/libs/BUILD Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

FlashInfer's BatchDecodeWithPagedKVCacheWrapper requires _qo_indptr_buf when both use_tensor_cores and _use_cuda_graph are True. We were setting _use_cuda_graph manually after construction without creating this buffer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 35 out of 35 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

LLLLKKKK · 2026-04-26T16:30:48Z

🤖 AI Code Review — PR #778

PR 概述

Title: feat: support py flashinfer decode cuda graph and remove c++ flashinfer op
Review 模式: 全量 review（检测到 force push/rebase）
规模: 35(GitHub) + 55(内源) files, +288/-3806

核心目标

将 FlashInfer decode attention 从 C++ 迁移到 Python (PyFlashinferDecodeAttnOp)，添加 decode CUDA graph 支持
删除 C++ FlashInfer ops 及 pybind 注册
清理死依赖（aiter、flash_attention、cufmha、trt_plugins 等）
修复 scheduler python_model_busy guard

Review 意见

问题

PR 规模较大，建议未来拆分 [P2]
build 依赖清理（aiter/fa/deep_ep/trt_plugins）与功能迁移相对独立，可单独 PR。当前已按 commit 清晰分离，影响可控。
PR description 为空 [P3]
建议补充 PR body 说明动机和设计。
fill_mla_params 残留 import [P3]
test_py_flashinfer_mha_decode.py 中仍 import fill_mla_params，decode 路径已不再使用，可后续清理。

亮点

上次 review 的 P0（残留 import 导致 ImportError）和 P1（PPU 头文件引用、缺少测试）全部修复
CUDA graph capture/replay 分离设计正确：capture 时绑定 buffer 并调用 plan()，replay 时仅 fill_params 更新内容
C++ 侧对空 prefix_lengths tensor 的防御性检查到位
4 个 commit 按功能清晰分离，原子性好
新增 3 个 decode CUDA graph 测试用例

整体评价

代码质量良好，所有上次阻塞问题已修复。P0=0, P1=0, P2=1, P3=2。

LGTM ready to ci — 当前 review 未发现阻塞级或重要级问题，可进入 CI 验证和合入流程。

lgtm ready to ci

wht21 · 2026-04-26T16:33:57Z

internal source has been updated, please review the changes!

JackTan25 requested a review from LLLLKKKK as a code owner March 13, 2026 08:47

JackTan25 changed the title ~~feat: support py flashinfer decode cuda graph and remove c++ flashinf…~~ feat: support py flashinfer decode cuda graph and remove c++ flashinfer op Mar 13, 2026

JackTan25 force-pushed the feat/support_py_flashinfer_decode_graph branch 2 times, most recently from a88dd6f to f0f385a Compare March 23, 2026 14:23

JackTan25 force-pushed the feat/support_py_flashinfer_decode_graph branch from f0f385a to b84c82f Compare April 24, 2026 17:16

Copilot AI review requested due to automatic review settings April 24, 2026 17:16

Copilot started reviewing on behalf of JackTan25 April 24, 2026 17:16 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

LLLLKKKK previously approved these changes Apr 26, 2026

View reviewed changes

LLLLKKKK dismissed their stale review via ca732cf April 26, 2026 08:52

LLLLKKKK force-pushed the feat/support_py_flashinfer_decode_graph branch from b84c82f to ca732cf Compare April 26, 2026 08:52

Copilot AI review requested due to automatic review settings April 26, 2026 09:03

LLLLKKKK force-pushed the feat/support_py_flashinfer_decode_graph branch from ca732cf to e4464d0 Compare April 26, 2026 09:03

Copilot started reviewing on behalf of LLLLKKKK April 26, 2026 09:03 View session