Skip to content

feat: support py flashinfer decode cuda graph and remove c++ flashinfer op#778

Merged
LLLLKKKK merged 4 commits intomainfrom
feat/support_py_flashinfer_decode_graph
Apr 26, 2026
Merged

feat: support py flashinfer decode cuda graph and remove c++ flashinfer op#778
LLLLKKKK merged 4 commits intomainfrom
feat/support_py_flashinfer_decode_graph

Conversation

@JackTan25
Copy link
Copy Markdown
Collaborator

@JackTan25 JackTan25 commented Mar 13, 2026

No description provided.

@JackTan25 JackTan25 requested a review from LLLLKKKK as a code owner March 13, 2026 08:47
@JackTan25 JackTan25 changed the title feat: support py flashinfer decode cuda graph and remove c++ flashinf… feat: support py flashinfer decode cuda graph and remove c++ flashinfer op Mar 13, 2026
@JackTan25 JackTan25 force-pushed the feat/support_py_flashinfer_decode_graph branch 2 times, most recently from a88dd6f to f0f385a Compare March 23, 2026 14:23
@LLLLKKKK
Copy link
Copy Markdown
Collaborator

🤖 AI Code Review — PR #778: feat: support py flashinfer decode cuda graph and remove c++ flashinfer op

概述

This PR enables CUDA Graph support for the Python FlashInfer decode path and removes the now-redundant C++ FlashInfer decode/prefill ops and their pybind registrations. It also fixes a crash in the C++ CUDA Graph runner where prefix_lengths.max() was called on a defined-but-empty tensor in decode mode.

优点

  • Follows the established CUDA graph pattern from the prefill paged path (PyFlashinferPagedPrefillImpl) consistently
  • Clean removal of C++ ops — internal C++ FlashInfer infrastructure (CudaFlashInfer.h/cc) is correctly preserved for the C++ attention path
  • Good defensive fix for prefix_lengths empty tensor in CudaGraphDecode.cc and CudaGraphRunner.cc
  • Type annotations improved (set_params uses concrete type instead of Any/ParamsBase)
  • Net code reduction (~150 lines removed)

建议改进

P1 - 重要

  1. No test coverage for the new decode CUDA graph path
    This PR adds support_cuda_graph() and prepare_cuda_graph() to PyFlashinferDecodeImpl, enabling CUDA graph capture/replay for FlashInfer decode. There are no new tests exercising this path. Given the known fragility of CUDA graph paths, at minimum a smoke test that exercises capture → replay with varying batch sizes would be valuable.

  2. __init__.py registration order change — confirm no regression
    The C++ FlashInferDecodeImpl/FlashInferPrefillImpl were registered at a specific position in the priority list (after TRT/TRTLLM but before the Py FlashInfer impls). Removing them changes the effective priority. The Py impls were already lower priority than TRT/TRTLLM/XQA, so this is likely fine, but please confirm no model configuration relied on the C++ FlashInfer impls being selected over the Py ones.

P2 - 建议

  1. fill_mla_params import is now unused in py_flashinfer_mha.py
    After switching PyFlashinferDecodeAttnOp.prepare() to use self.fmha_params.fill_params(), the fill_mla_params import (line 34) is no longer used anywhere in this file. Should be removed.

  2. prefix_lengths empty tensor propagation to Python fill_params
    The C++ side now initializes prefix_lengths = torch::empty({0}) for decode CUDA graph mode. Verify that FlashInferMlaAttnParams::fill_params handles a size-0 prefix_lengths tensor correctly (decode mode likely ignores it, but worth a quick check).

P3 - Nit

  1. PyFlashinferDecodeAttnOp.__init__ signature change
    Constructor now requires attn_inputs as a second parameter. Only known caller (PyFlashinferDecodeImpl) is updated. Low risk since it's internal, but note for any downstream consumers.

总结

Well-structured PR that follows established patterns. The C++ op removal is clean with no dangling references. Main ask: add test coverage for the new decode CUDA graph path before merging.

@LLLLKKKK
Copy link
Copy Markdown
Collaborator

🤖 AI Code Review — PR #778 feat: support py flashinfer decode cuda graph and remove c++ flashinfer op

概述

PR 为 Python FlashInfer decode 路径添加 CUDA Graph 支持,同时移除已被 Python 实现完全替代的 C++ FlashInfer ops。还修复了 C++ CUDA Graph runner 中 prefix_lengths 空 tensor 调用 .max() 的 crash。

优点

  • 遵循已有的 prefill paged CUDA graph 模式,一致性好
  • C++ op 移除干净,内部 C++ FlashInfer 基础设施保留(仅移除 Python-facing binding)
  • prefix_lengths 空 tensor 防御性处理正确

建议改进

P1 - 重要

  1. 缺少 decode CUDA graph 端到端测试 — 新增的 support_cuda_graph()prepare_cuda_graph() 路径无测试覆盖。CUDA graph 路径已知脆弱(capture/replay 参数过期等),建议至少添加一个 smoke test 验证 capture → replay 流程。

  2. __init__.py 注册顺序变更 — 移除 C++ FlashInfer impl 后,Py FlashInfer impls 的优先级发生变化。请确认没有模型配置依赖 C++ FlashInfer impl 被优先选中。

P2 - 建议

  1. fill_mla_params import 可能未使用PyFlashinferDecodeAttnOp.prepare() 改用 self.fmha_params.fill_params() 后,fill_mla_params 可能不再被调用,建议清理。

  2. prefix_lengths 空 tensor 传播到 Python 侧 — C++ 初始化 torch::empty({0}),需确认 FlashInferMlaAttnParams::fill_params 正确处理 size-0 tensor。

总结

结构清晰,模式一致。主要风险是新 decode CUDA graph 路径缺少测试。建议补充测试后合入。

@LLLLKKKK
Copy link
Copy Markdown
Collaborator

LLLLKKKK commented Apr 9, 2026

🤖 AI Code Review — PR #778
Head SHA: f0f385a3372326120152bab167842cdd096e2037 | Verdict: P2

Summary

Major refactor: removes C++ FlashInfer decode/prefill ops (-468 lines), promotes Python FlashInfer decode to support CUDA graph capture/replay, and fixes CUDA graph decode crash when prefix_lengths is empty. The Python FlashInfer path becomes the sole FlashInfer implementation.

Findings

  1. [P2] Reliance on flashinfer internal attributes for CUDA graph init
    PyFlashinferDecodeAttnOp.prepare() accesses decode_wrapper._fixed_batch_size, _use_cuda_graph, _paged_kv_indptr_buf, etc. These are internal implementation details of BatchDecodeWithPagedKVCacheWrapper. A flashinfer library update could break this silently.

  2. [P2] Two-phase initialization pattern (None + set_params)
    Multiple classes init fmha_params = None and require set_params() before prepare(). The assert guards are good but the pattern is error-prone. Constructor injection or a factory method would be safer.

  3. [P2] prefix_lengths.size(0) > 0 guard may be needed in more places
    The fix adds this check in CudaGraphDecode.cc and CudaGraphRunner.cc, but other code paths may also call .max() on prefix_lengths without this guard. A codebase-wide audit would be prudent.

  4. [P2] Removed C++ ops may have external references
    FlashInferDecodeOp, FlashInferPrefillOp, FlashInferAttnParams are removed from the .pyi stubs and registration. Any downstream code referencing these by name will break.

Excellent cleanup overall. The CUDA graph support with forbid_realloc is well-designed, and the prefix_lengths empty tensor fix addresses a real crash.

@JackTan25 JackTan25 force-pushed the feat/support_py_flashinfer_decode_graph branch from f0f385a to b84c82f Compare April 24, 2026 17:16
Copilot AI review requested due to automatic review settings April 24, 2026 17:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates FlashInfer attention to the Python (flashinfer-python) path for decode CUDA graph support, while removing the legacy C++ FlashInfer op/bindings from the codebase.

Changes:

  • Removed C++ FlashInfer op implementation and pybind registrations, and updated Bazel targets accordingly.
  • Updated PyFlashinfer attention implementations to support decode CUDA graph replay by avoiding plan() during replay and reusing preallocated page-table buffers.
  • Adjusted CUDA graph runner handling of prefix_lengths to treat decode CUDA graph mode as “defined but empty”.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
rtp_llm/ops/rtp_llm_ops.pyi Removes FlashInfer C++ op stubs from the public typing surface.
rtp_llm/models_py/modules/factory/attention/cuda_impl/py_flashinfer_mha.py Adds/rewires PyFlashinfer decode CUDA graph replay support and shifts params ownership to caller-managed objects.
rtp_llm/models_py/modules/factory/attention/cuda_impl/flash_infer.py Deletes legacy Python wrapper selecting the C++ FlashInfer ops.
rtp_llm/models_py/modules/factory/attention/init.py Stops registering removed C++ FlashInfer impls and adjusts PyFlashinfer impl registration.
rtp_llm/models_py/bindings/cuda/ops/CudaFlashInfer.h Removes C++ FlashInfer paged-attn params/impl header.
rtp_llm/models_py/bindings/cuda/ops/CudaFlashInfer.cc Removes C++ FlashInfer paged-attn params/impl source.
rtp_llm/models_py/bindings/cuda/ops/BUILD Updates CUDA bindings build graph to drop removed sources and relocate flashinfer deps.
rtp_llm/models_py/bindings/cuda/RegisterBaseBindings.hpp Removes FlashInfer op binding include.
rtp_llm/models_py/bindings/cuda/RegisterAttnOpBindings.hpp Removes FlashInfer op registration call.
rtp_llm/models_py/bindings/cuda/FlashInferOp.h Deletes C++ FlashInfer op wrapper header.
rtp_llm/models_py/bindings/cuda/FlashInferOp.cc Deletes C++ FlashInfer op wrapper source.
rtp_llm/models_py/bindings/core/ExecOps.cc Removes now-dead include of CudaFlashInfer.
rtp_llm/cpp/cuda_graph/cuda_graph_runner.cc Treats decode CUDA graph prefix lengths as empty and guards size-based logic.
rtp_llm/cpp/cuda_graph/cuda_graph_decode.cc Guards prefix-length max computation when prefix_lengths is empty.
Comments suppressed due to low confidence (1)

rtp_llm/ops/rtp_llm_ops.pyi:37

  • rtp_llm_ops.pyi references typing.Any in signatures, but the module typing isn't imported (only Any is imported). This breaks stub parsing/type-checking. Import typing or switch these annotations to use the already-imported Any directly.
class SelectTopkOp:
    def __init__(
        self, attn_configs: typing.Any
    ) -> None: ...

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 85 to 103
@@ -98,7 +98,7 @@ def __init__(
def __del__(self):
release_py_flashinfer_workspace_buffer(self.g_workspace_buffer)

def set_params(self, params: Any):
def set_params(self, params: rtp_llm_ops.FlashInferMlaAttnParams):
"""Set the params object to be used by this op."""
self.fmha_params = params
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyFlashinferPrefillPagedAttnOp now initializes fmha_params to None, but prepare() calls self.fmha_params.fill_params(...) without guarding against None. This will raise an AttributeError if set_params() wasn't called (and also violates the prepare() return type). Add an explicit assert self.fmha_params is not None (or initialize it in __init__) similar to the other ops in this file.

Copilot uses AI. Check for mistakes.
self.seq_size_per_block,
forbid_realloc=True,
)

Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA-graph decode replay behavior is now specialized via prepare_for_cuda_graph_replay() / prepare_cuda_graph() to avoid calling plan() during replay. There are CUDA graph tests in rtp_llm/cpp/cuda_graph/tests/, but none appear to explicitly exercise PyFlashinferDecodeImpl (it can be preempted by higher-priority decode impls). Add/adjust a test to force selection of PyFlashinferDecodeImpl in CUDA graph decode mode and validate replay works (no plan during replay, correct outputs).

Suggested change
def prepare_cuda_graph(self, attn_inputs: PyAttentionInputs) -> None:
"""CUDA-graph replay entry point.
Keep decode replay on the specialized no-plan path by delegating to
prepare_for_cuda_graph_replay(). This avoids any fallback to the
generic prepare() flow during replay.
"""
self.prepare_for_cuda_graph_replay(attn_inputs)

Copilot uses AI. Check for mistakes.
@LLLLKKKK
Copy link
Copy Markdown
Collaborator

AI Code Review — PR #778

feat: support py flashinfer decode cuda graph and remove c++ flashinfer op

概要

将 decode 阶段的 FlashInfer attention 从 C++ 实现迁移到 Python(py_flashinfer),并支持 decode CUDA graph。删除 ~1100 行 C++ 代码,新增 prepare_for_cuda_graph_replay() 解决 plan() 与 capture stream 的竞态条件。

问题

  1. [P0] __init__.py 残留已删除模块的导入

    rtp_llm/models_py/modules/factory/attention/__init__.py 第 118-124 行仍然导入已删除的 flash_infer.py 模块:

    from rtp_llm.models_py.modules.factory.attention.cuda_impl.flash_infer import (
        FlashInferDecodeImpl,
        FlashInferPrefillImpl,
    )

    运行时会触发 ImportError,导致 attention factory 初始化失败。

    建议:删除第 118-124 行。

  2. [P1] PPU 平台悬挂 include

    [SENSITIVE] 仍 include 已删除的 CudaFlashInfer.h,PPU 平台构建会编译失败。

    建议:检查该 include 是否仍被需要,如不需要则移除。

  3. [P1] Decode CUDA graph 新功能缺少测试

    新增的 prepare_cuda_graphsupport_cuda_graphprepare_for_cuda_graph_replay 是核心新功能,但没有对应的单元测试或 smoke 测试。

    建议:至少添加一个 smoke 测试验证 decode CUDA graph + py_flashinfer 路径。

  4. [P2] 直接操作 decode_wrapper 内部私有属性

    prepare() 中直接设置 decode_wrapper._fixed_batch_size 等私有属性,flashinfer 版本升级时可能变更。建议添加注释说明依赖的 flashinfer 版本。

  5. [P2] PR description 为空

    对于删除 ~1200 行 C++ 代码并新增 CUDA graph 支持的 PR,应说明迁移动机和设计决策。

结论

存在阻塞/重要问题,不建议合入。需修复 P0(__init__.py 残留导入)和 P1(PPU include、测试覆盖)后方可合入。

LLLLKKKK
LLLLKKKK previously approved these changes Apr 26, 2026
@github-actions
Copy link
Copy Markdown

CI dispatcher could not find a native build run for HEAD SHA b84c82f2.

This can happen if the PR was opened before the CI architecture change, or if the original run was deleted.

To fix: push any commit (even empty: git commit --allow-empty -m "trigger CI" && git push) to create a native build run, then re-approve or post lgtm ready to ci.

@LLLLKKKK LLLLKKKK force-pushed the feat/support_py_flashinfer_decode_graph branch from b84c82f to ca732cf Compare April 26, 2026 08:52
@LLLLKKKK
Copy link
Copy Markdown
Collaborator

🤖 AI Code Review — PR #778

feat: support py flashinfer decode cuda graph and remove c++ flashinfer op

核心改动

  • 移除 C++ FlashInfer ops(FlashInferOp、CudaFlashInfer)和 cufmha 实现(-2288 行)
  • 迁移到 Python FlashInfer,新增 CUDA graph decode 支持
  • 修复 cuda_graph_runner.cc 中 decode 模式 prefix_lengths 处理
  • 简化 GatherBatchScheduler 的 python_model_busy 判断

P2: GatherBatchScheduler 行为变更

GatherBatchScheduler.hpython_model_busymodel_specific_config_.load_python_model && !running_streams_.empty() 改为 !running_streams_.empty()。这使得所有模型(不仅是 Python 模型)都会等待 running streams 清空后才 gather,可能影响非 Python 模型的吞吐。建议确认这是有意为之。

P3: fmha_params 初始化

py_flashinfer_mha.pyself.fmha_params 初始化为 None,但 set_params 的类型标注为 FlashInferMlaAttnParams。若 set_params 未被调用就执行 forward,会 AttributeError。

结论

LGTM ready to ci — 无阻塞级问题。代码精简效果显著,CUDA graph 支持设计合理。

@wht21
Copy link
Copy Markdown
Collaborator

wht21 commented Apr 26, 2026

internal source has been updated, please review the changes!

Copilot AI review requested due to automatic review settings April 26, 2026 09:03
@LLLLKKKK LLLLKKKK force-pushed the feat/support_py_flashinfer_decode_graph branch from ca732cf to e4464d0 Compare April 26, 2026 09:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 85 to 86
self.fmha_params: Optional[rtp_llm_ops.FlashInferMlaAttnParams] = None
self.enable_cuda_graph = attn_inputs.is_cuda_graph
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyFlashinferPrefillPagedAttnOp now initializes fmha_params to None, but prepare() still dereferences self.fmha_params without any guard in this class. To avoid runtime AttributeError when set_params() is missed, add an explicit assertion/check before using fmha_params (as done in other ops in this file).

Copilot uses AI. Check for mistakes.
Comment on lines +60 to +61
// Defer the gather until running streams drain so the next batch is pure prefill.
const bool python_model_busy = !running_streams_.empty();
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes python_model_busy true whenever running_streams_ is non-empty, regardless of model_specific_config_.load_python_model. That alters GatherBatchScheduler behavior for non-Python models (it will never gather new streams while any stream is running), which can reduce batching/throughput. If the intent is to guard only the PyWrappedModel path, restore the load_python_model condition or rename the variable/comment to reflect the new global behavior.

Suggested change
// Defer the gather until running streams drain so the next batch is pure prefill.
const bool python_model_busy = !running_streams_.empty();
// Defer gather only for python models until running streams drain so the next batch is pure prefill.
const bool python_model_busy =
model_specific_config_.load_python_model && !running_streams_.empty();

Copilot uses AI. Check for mistakes.
cu_seqlens.slice(0, 1, max_bs_ + 1) = input_lengths.cumsum(0);
if (prefix_lengths.defined()) {
if (prefix_lengths.defined() && prefix_lengths.size(0) > 0) {
cu_kv_seqlens.slice(0, 1, max_bs_ + 1) = input_lengths.add(prefix_lengths).cumsum(0);
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When prefix_lengths is an empty tensor (decode CUDA graph mode), cu_kv_seqlens is left as all zeros because it’s only computed in the prefix_lengths.size(0) > 0 branch. This likely produces incorrect cu_kv_seqlens for decode (many kernels/tests assume cu_kv_seqlens at least matches cu_seqlens when no prefix lengths are present). Consider adding an else that sets cu_kv_seqlens from input_lengths.cumsum(0) (or copies from cu_seqlens) when prefix lengths are empty.

Suggested change
cu_kv_seqlens.slice(0, 1, max_bs_ + 1) = input_lengths.add(prefix_lengths).cumsum(0);
cu_kv_seqlens.slice(0, 1, max_bs_ + 1) = input_lengths.add(prefix_lengths).cumsum(0);
} else {
cu_kv_seqlens.slice(0, 1, max_bs_ + 1) = input_lengths.cumsum(0);

Copilot uses AI. Check for mistakes.
inputs.attention_inputs.prefix_lengths_d = inputs.attention_inputs.prefix_lengths.cuda();
} else {
// Decode CUDA graph mode: prefix_lengths should be empty tensor
inputs.attention_inputs.prefix_lengths = torch::empty({0}, options_cpu_int32_).pin_memory();
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In decode CUDA graph mode this sets prefix_lengths to an empty CPU tensor but does not initialize the corresponding device mirror prefix_lengths_d. Downstream Python code frequently consumes attn_inputs.prefix_lengths_d; leaving it undefined can cause runtime errors. Consider initializing prefix_lengths_d to an empty CUDA int32 tensor in this branch to keep host/device mirrors consistent.

Suggested change
inputs.attention_inputs.prefix_lengths = torch::empty({0}, options_cpu_int32_).pin_memory();
inputs.attention_inputs.prefix_lengths = torch::empty({0}, options_cpu_int32_).pin_memory();
inputs.attention_inputs.prefix_lengths_d = torch::empty({0}, options_cuda_int32_);

Copilot uses AI. Check for mistakes.
capture_mem_hold_.py_model_inputs_.attention_inputs.prefix_lengths_d.slice(0, 0, batch_size);
} else {
// For decode CUDA graph mode: prefix_lengths is empty tensor
inputs.attention_inputs.prefix_lengths = capture_mem_hold_.py_model_inputs_.attention_inputs.prefix_lengths;
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In decode CUDA graph mode (prefix_lengths is empty), this branch assigns inputs.attention_inputs.prefix_lengths but leaves inputs.attention_inputs.prefix_lengths_d unset, while other code paths assume the device mirror exists. If prefix_lengths_d is initialized as an empty CUDA tensor during capture, it should also be threaded through here (e.g., assign/slice it alongside prefix_lengths).

Suggested change
inputs.attention_inputs.prefix_lengths = capture_mem_hold_.py_model_inputs_.attention_inputs.prefix_lengths;
inputs.attention_inputs.prefix_lengths = capture_mem_hold_.py_model_inputs_.attention_inputs.prefix_lengths;
inputs.attention_inputs.prefix_lengths_d =
capture_mem_hold_.py_model_inputs_.attention_inputs.prefix_lengths_d;

Copilot uses AI. Check for mistakes.
Comment on lines 118 to 124
from rtp_llm.models_py.modules.factory.attention.cuda_impl.py_flashinfer_mha import (
PyFlashinferDecodeImpl,
PyFlashinferPagedPrefillImpl,
PyFlashinferPrefillImpl,
)

PREFILL_MHA_IMPS.append(PyFlashinferPrefillImpl)
PREFILL_MHA_IMPS.append(PyFlashinferPagedPrefillImpl)
DECODE_MHA_IMPS.append(PyFlashinferDecodeImpl)
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the CUDA device path, PyFlashinferPrefillImpl/PyFlashinferDecodeImpl are already imported and registered in the if device_type == DeviceType.Cuda: block above. This second unconditional import/append will register duplicates (and also runs on non-CUDA device types), which can affect impl priority/selection. Consider removing this block for CUDA, or gating it to non-CUDA devices only.

Copilot uses AI. Check for mistakes.
@LLLLKKKK LLLLKKKK force-pushed the feat/support_py_flashinfer_decode_graph branch from e4464d0 to fe2075c Compare April 26, 2026 09:15
@LLLLKKKK
Copy link
Copy Markdown
Collaborator

/retest

Copilot AI review requested due to automatic review settings April 26, 2026 09:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +357 to +359
assert (
self.fmha_params is not None
), "fmha_params must be set before calling prepare"
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using assert for required runtime checks is unsafe because asserts can be stripped with Python -O, potentially leading to a later None dereference. Replace this with an explicit exception (e.g., RuntimeError/ValueError) to guarantee the check is enforced in production.

Suggested change
assert (
self.fmha_params is not None
), "fmha_params must be set before calling prepare"
if self.fmha_params is None:
raise RuntimeError("fmha_params must be set before calling prepare")

Copilot uses AI. Check for mistakes.
Comment on lines +682 to +684
assert (
self.fmha_params is not None
), "fmha_params must be set before calling prepare"
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: assert may be optimized away. Use an explicit exception to enforce that set_params() was called before prepare().

Suggested change
assert (
self.fmha_params is not None
), "fmha_params must be set before calling prepare"
if self.fmha_params is None:
raise RuntimeError("fmha_params must be set before calling prepare")

Copilot uses AI. Check for mistakes.
Comment on lines +694 to +703
if self.enable_cuda_graph and self.decode_wrapper._fixed_batch_size is None:
self.decode_wrapper._use_cuda_graph = True
self.decode_wrapper._paged_kv_indptr_buf = (
self.fmha_params.decode_page_indptr_d
)
self.decode_wrapper._paged_kv_last_page_len_buf = (
self.fmha_params.paged_kv_last_page_len_d
)
self.decode_wrapper._paged_kv_indices_buf = self.fmha_params.page_indice_d
self.decode_wrapper._fixed_batch_size = attn_inputs.input_lengths.size(0)
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directly mutates private/internal fields on decode_wrapper (e.g., _paged_kv_indptr_buf, _fixed_batch_size). That is brittle: changes in FlashInfer wrapper internals can silently break CUDA graph support. Prefer adding/using a dedicated public method on the wrapper (e.g., enable_cuda_graph_buffers(indptr, indices, last_page_len, fixed_bs)) or a small adapter layer in your codebase that centralizes this wiring in one place.

Copilot uses AI. Check for mistakes.
"""
Prepare the decode wrapper with paged KV cache parameters.

forbid_realloc: True only when called from prepare_cuda_graph (replay); forbids buffer realloc.
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says forbid_realloc=True is for CUDA graph replay, but replay is now handled by prepare_for_cuda_graph_replay() (which also enforces no re-plan). Either update this docstring to reflect current call paths, or remove/privatize forbid_realloc if it’s no longer part of the intended public behavior.

Suggested change
forbid_realloc: True only when called from prepare_cuda_graph (replay); forbids buffer realloc.
forbid_realloc: When True, prevents buffer reallocation while filling
params. CUDA graph replay is handled via
``prepare_for_cuda_graph_replay()``.

Copilot uses AI. Check for mistakes.
Comment on lines +60 to 62
// Defer the gather until running streams drain so the next batch is pure prefill.
const bool python_model_busy = !running_streams_.empty();
if (waiting_streams_.size() >= static_cast<size_t>(gather_batch_size_) && !python_model_busy) {
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes behavior for all models (not just PyWrappedModel): gathering is now deferred whenever running_streams_ is non-empty, which can reduce throughput and increase tail latency by preventing batching while decode is in progress. If the constraint is specific to Python-wrapped execution (mixed prefill+decode shape mismatch), consider gating this behind a config flag (e.g., load_python_model or a dedicated disallow_mixed_prefill_decode_batches) so non-Python/C++ paths can retain previous scheduling behavior.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 26, 2026 10:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 84 to 86
self.max_seq_len = attn_configs.max_seq_len
self.fmha_params = rtp_llm_ops.FlashInferMlaAttnParams()
self.fmha_params: Optional[rtp_llm_ops.FlashInferMlaAttnParams] = None
self.enable_cuda_graph = attn_inputs.is_cuda_graph
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyFlashinferPrefillPagedAttnOp.fmha_params is changed to be optional/initialized to None, but prepare() later dereferences it (calls self.fmha_params.fill_params(...)) without a local guard. This makes the class fragile: any caller that invokes prepare() before set_params() will hit an AttributeError. Consider either initializing fmha_params in __init__ (as before) or adding an explicit precondition check in prepare() similar to the other ops in this module.

Copilot uses AI. Check for mistakes.
@LLLLKKKK LLLLKKKK force-pushed the feat/support_py_flashinfer_decode_graph branch from 9316f6f to 48a958d Compare April 26, 2026 11:09
Copilot AI review requested due to automatic review settings April 26, 2026 11:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 471 to 482
// prefix_lengths [batch_size, int32] (for attention `prepare`)
if (num_tokens_per_bs_ > 1 && !is_prefill_cuda_graph_mode_) {
inputs.attention_inputs.prefix_lengths =
torch::full({int(max_bs_)}, max_seq_len_ - num_tokens_per_bs_, options_cpu_int32_).pin_memory();
inputs.attention_inputs.prefix_lengths_d = inputs.attention_inputs.prefix_lengths.cuda();
} else if (is_prefill_cuda_graph_mode_) {
inputs.attention_inputs.prefix_lengths = torch::zeros({int(max_bs_)}, options_cpu_int32_).pin_memory();
inputs.attention_inputs.prefix_lengths_d = inputs.attention_inputs.prefix_lengths.cuda();
} else {
// Decode CUDA graph mode: prefix_lengths should be empty tensor
inputs.attention_inputs.prefix_lengths = torch::empty({0}, options_cpu_int32_).pin_memory();
}
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In decode CUDA-graph mode you now set prefix_lengths to an empty host tensor, but you don't initialize the corresponding device mirror attention_inputs.prefix_lengths_d. PyAttentionInputs has a concrete prefix_lengths_d field and some Python model code accesses attention_inputs.prefix_lengths_d; leaving it undefined can cause runtime failures. Consider setting prefix_lengths_d to an empty CUDA tensor here (to match non-graph decode behavior), and update the CUDA-graph replay copy logic to skip/zero-copy prefix_lengths_d when it is empty so it doesn't memcpy batch_size*sizeof(int) into a 0-sized buffer.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 26, 2026 12:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings April 26, 2026 14:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 35 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

LLLLKKKK and others added 4 commits April 26, 2026 23:02
… graph

- Remove C++ FlashInfer ops (CudaFlashInfer, FlashInferOp) and old Python wrapper (flash_infer.py)
- Trim 3rdparty/flashinfer BUILD to sampling/norm/activation kernels only
- Add decode CUDA graph support to PyFlashinferDecodeAttnOp with buffer management
- Add comprehensive decode UT covering correctness and CUDA graph buffer validation
- Remove C++ FlashInfer decode graph path from cuda_graph_runner
- Fix fmha_params auto-creation in constructors (restore original author design)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PyWrappedModel is now the default model path so the prefill+decode mixing
guard should always apply. Make python_model_busy depend only on
running_streams_ being non-empty, and update GatherBatchSchedulerTest to
drop the now-dead load_python_model parameter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ugins, fa_deps, deep_ep_deps)

- Delete 3rdparty/aiter/ directory and all related BUILD/config references
- Remove flash_attention, aiter_src from open_source/deps/git.bzl
- Remove fa_deps(), deep_ep_deps(), kernel_so_deps(), arpc_deps(), trt_plugins() from arch_select.bzl
- Delete cufmha/ directory and nv_trt_plugins/BUILD stub
- Remove aiter link dep from rocm BUILD and decode_aiter_attn from OpData.h
- Clean up aiter genrules and copy targets from rtp_llm/libs/BUILD

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FlashInfer's BatchDecodeWithPagedKVCacheWrapper requires _qo_indptr_buf
when both use_tensor_cores and _use_cuda_graph are True. We were setting
_use_cuda_graph manually after construction without creating this buffer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 26, 2026 15:38
@LLLLKKKK LLLLKKKK force-pushed the feat/support_py_flashinfer_decode_graph branch from 064182b to e04b793 Compare April 26, 2026 15:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 35 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@LLLLKKKK
Copy link
Copy Markdown
Collaborator

🤖 AI Code Review — PR #778

PR 概述

Title: feat: support py flashinfer decode cuda graph and remove c++ flashinfer op
Review 模式: 全量 review(检测到 force push/rebase)
规模: 35(GitHub) + 55(内源) files, +288/-3806

核心目标

  1. 将 FlashInfer decode attention 从 C++ 迁移到 Python (PyFlashinferDecodeAttnOp),添加 decode CUDA graph 支持
  2. 删除 C++ FlashInfer ops 及 pybind 注册
  3. 清理死依赖(aiter、flash_attention、cufmha、trt_plugins 等)
  4. 修复 scheduler python_model_busy guard

Review 意见

问题

  1. PR 规模较大,建议未来拆分 [P2]
    build 依赖清理(aiter/fa/deep_ep/trt_plugins)与功能迁移相对独立,可单独 PR。当前已按 commit 清晰分离,影响可控。

  2. PR description 为空 [P3]
    建议补充 PR body 说明动机和设计。

  3. fill_mla_params 残留 import [P3]
    test_py_flashinfer_mha_decode.py 中仍 import fill_mla_params,decode 路径已不再使用,可后续清理。

亮点

  • 上次 review 的 P0(残留 import 导致 ImportError)和 P1(PPU 头文件引用、缺少测试)全部修复
  • CUDA graph capture/replay 分离设计正确:capture 时绑定 buffer 并调用 plan(),replay 时仅 fill_params 更新内容
  • C++ 侧对空 prefix_lengths tensor 的防御性检查到位
  • 4 个 commit 按功能清晰分离,原子性好
  • 新增 3 个 decode CUDA graph 测试用例

整体评价

代码质量良好,所有上次阻塞问题已修复。P0=0, P1=0, P2=1, P3=2。

LGTM ready to ci — 当前 review 未发现阻塞级或重要级问题,可进入 CI 验证和合入流程。

lgtm ready to ci

@wht21
Copy link
Copy Markdown
Collaborator

wht21 commented Apr 26, 2026

internal source has been updated, please review the changes!

@LLLLKKKK LLLLKKKK merged commit f0132ee into main Apr 26, 2026
10 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants