feat: use rtp-kernel fp8 group gemm by moui0 · Pull Request #807 · alibaba/rtp-llm

moui0 · 2026-03-20T03:37:26Z

No description provided.

LLLLKKKK · 2026-03-20T04:33:10Z

🤖 AI Code Review — PR #807 feat: use rtp-kernel fp8 group gemm

Stats: +84 / -2901, 21 files changed

P1 (Important)

P1-1: fp8_grouped_gemm_ptpc — 3 undocumented positional args hardcoded to 0
rtp_llm/models_py/kernels/cuda/fp8_kernel/fp8_kernel.py

The new call inserts 3 positional 0 arguments between cluster_k and swap_ab. No comments explain what they control or whether 0 is a safe default. With 23 positional args total, this is fragile and hard to verify.

Recommendation: Add inline comments documenting these 3 parameters. Confirm with rtp-kernel maintainer that 0 is correct.

P1-2: Test coverage gaps for full argument path
rtp_llm/models_py/kernels/cuda/test/cutlass_fp8_grouped_gemm_test.py

The test only passes 12 positional args (up to per_out_ch), relying on defaults for everything else. The profile=True path (with tile/cluster selection), the 3 new params with non-zero values, and swap_ab=True are all untested.

Recommendation: Add at least one test case exercising the full argument list, especially the profile path.

P1-3: cutlass_moe_mm_fp8_scaled silently drops return value
rtp_llm/models_py/kernels/cuda/fp8_kernel/fp8_kernel.py

The old code had return cutlass_moe_mm(...). The new code calls fp8_grouped_gemm_ptpc(...) without return. While current callers rely on in-place output, this is an API behavior change.

Recommendation: Either consistently return the result, or add a comment confirming the function is purely in-place.

P2 (Suggestion)

P2-1: Consider keyword arguments for fp8_grouped_gemm_ptpc

23 positional arguments is extremely error-prone. If the binding supports kwargs, use them for at least tile/cluster/profile/swap_ab:

fp8_grouped_gemm_ptpc(
    output, aq, w, aq_scale, w_scale,
    expert_offsets, problem_sizes,
    a_strides, b_strides, c_strides,
    per_act_token, per_out_ch,
    tile_m, tile_n, tile_k,
    cluster_m, cluster_n, cluster_k,
    0, 0, 0,  # TODO: document these params
    swap_ab=swap_ab,
    profile=True,
)

P2-2: Open-source wheel URL path inconsistency
open_source/deps/requirements_torch_gpu_cuda12_9.txt — the open-source URL previously had no date-stamped subdirectory but now adds rtp_kernel_260317/. Minor inconsistency with internal URL structure.

Verdict

Approve with comments. The migration is structurally sound — all references are consistently updated and old C++ code is cleanly removed. Main asks: (1) document the 3 new positional params, (2) add test coverage for the full argument path, (3) clarify the dropped return value.

Copilot

Pull request overview

This PR switches FP8 grouped GEMM for MoE from in-repo CUTLASS/pybind implementations to the rtp_kernel packaged implementation, removing the old bindings/kernel sources and updating call sites + deps accordingly.

Changes:

Replace compute_ops/pybind usage of FP8 grouped GEMM + MoE metadata helpers with rtp_kernel.fp8_group_gemm equivalents in MoE executors and FP8 kernel wrapper.
Remove CUTLASS FP8 grouped GEMM C++/CUDA sources, Bazel targets, and Python bindings/stubs for the old ops.
Update GPU CUDA 12.9 requirements to pull the new rtp-kernel wheel location; adjust the FP8 grouped GEMM test to call rtp_kernel directly.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
rtp_llm/ops/librtp_compute_ops/rtp_llm_ops.pyi	Removes `cutlass_moe_mm` + related helper symbols from the Python stub export surface.
rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/cutlass_w4a8_moe.py	Switches MoE offset/problem-size helpers to `rtp_kernel.fp8_group_gemm` and drops `compute_ops` dependency.
rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/cutlass_moe.py	Switches MoE offset/problem-size helpers to `rtp_kernel.fp8_group_gemm` and drops `compute_ops` dependency.
rtp_llm/models_py/kernels/cuda/test/cutlass_fp8_grouped_gemm_test.py	Updates test to call `rtp_kernel.fp8_group_gemm.fp8_grouped_gemm_ptpc` directly.
rtp_llm/models_py/kernels/cuda/fp8_kernel/fp8_kernel.py	Replaces `compute_ops.cutlass_moe_mm` calls with `rtp_kernel` grouped GEMM, adding new scheduling/profile args.
rtp_llm/models_py/bindings/cuda/RegisterCudaOps.cc	Removes pybind registrations for `cutlass_moe_mm` and MoE metadata helper functions.
rtp_llm/models_py/bindings/cuda/BUILD	Drops dependency on the removed `fp8_group_cu` CUTLASS target.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/moe_data.cu	Deletes in-repo MoE metadata CUDA implementation (moved to `rtp_kernel`).
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/include/scaled_mm_epilogues_c3x.hpp	Deletes in-repo CUTLASS epilogue helpers for FP8 grouped GEMM.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/include/scalar_type.hpp	Deletes scalar-type support code used by the old CUTLASS FP8 grouped GEMM path.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/include/heuristics_search.hpp	Deletes in-repo heuristic config selection code for FP8 grouped GEMM.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/include/broadcast_load_epilogue_c3x.hpp	Deletes in-repo broadcast epilogue implementation used by the old path.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/include/broadcast_load_epilogue_array_c3x.hpp	Deletes in-repo array-broadcast epilogue implementation used by the old path.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/grouped_mm_c3x_sm90.cu	Deletes old CUTLASS SM90 grouped GEMM kernel implementation.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/grouped_mm_c3x.cuh	Deletes old CUTLASS grouped GEMM kernel header/launcher utilities.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/fp8_group_gemm.h	Deletes old public header for the removed grouped GEMM API.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/fp8_group_gemm.cu	Deletes old CUDA utility implementation used by the removed bindings.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/README.md	Removes documentation for the deleted in-repo grouped GEMM.
rtp_llm/cpp/cuda/cutlass/BUILD	Removes the `fp8_group_cu` Bazel library target.
open_source/deps/requirements_torch_gpu_cuda12_9.txt	Updates `rtp-kernel` wheel URL to a new path.
open_source/deps/requirements_lock_torch_gpu_cuda12_9.txt	Updates locked `rtp-kernel` wheel URL + hash to match the new artifact.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

LLLLKKKK · 2026-03-20T12:59:56Z

PR #807 — feat: use rtp-kernel fp8 group gemm | Code Review v2 (incremental)

Head SHA: b988126f (prev: af7c5a29)
Stats: +82 / -2910, 21 files (GitHub) + 5 internal files

v1 Finding Status

#	Finding	Severity	Status
P0-1	`get_cutlass_moe_mm_without_permute_info` / `get_cutlass_batched_moe_mm_data` may not exist in new wheel	P0	⚠️ Still needs runtime verification
P1-1	Undocumented positional args (`0, 0, 0`)	P1	✅ Fixed — now uses keyword args (`stage_count=0, mainloop_sched=0, epilogue_sched=0`)
P1-3	Test coverage gaps (profile/swap_ab/helper functions)	P1	❌ Not fixed — no new test cases added
P1-4	Dropped `return` in `cutlass_moe_mm_fp8_scaled`	P1	✅ Fixed — added `-> None` return type annotation
P2-1	Use keyword arguments	P2	✅ Fixed (same as P1-1)
P2-2	Wheel version still `0.1.0`	P2	Acceptable — URL path + hash changed, lock files updated
P2-3	Only CUDA 12.9 requirements updated	P2	✅ Fixed — ARM and x86_64 variants all updated in internal source

What changed in this push

fp8_kernel.py: profile=True branch now uses keyword args for stage_count, mainloop_sched, epilogue_sched, swap_ab, profile. profile=False branch simplified to 12 positional args only (dead-code tile values 128,128,128,1,1,1 removed). Return type annotated as -> None.
Internal deps: arch_select.bzl, ARM requirements, and lock files all updated to rtp_kernel_260317 with correct SHA256 hashes.
No changes to test file beyond the original cutlass_moe_mm -> fp8_grouped_gemm_ptpc swap.

CI Status

GitHub: build check in progress; all other checks (CodeQL, Agent, etc.) passed.
Internal: Code quality scan (COMPLETE) and auto-CR (SUCCESS) passed. No runtime test results yet.

Verdict

Conditional approve. Code quality issues from v1 are addressed. The migration is clean and consistent.

Must-verify before merge:

Confirm rtp_kernel_260317 exports get_cutlass_moe_mm_without_permute_info and get_cutlass_batched_moe_mm_data from rtp_kernel.fp8_group_gemm (P0-1).
Wait for GitHub build check to pass.

Follow-up:
3. Add test coverage for profile=True, swap_ab=True, and the two helper functions (P1-3).

🤖 Automated review by Claude | v1 review | v2 incremental

LLLLKKKK · 2026-03-23T07:49:31Z

🤖 AI Code Review — PR #807 — feat: use rtp-kernel fp8 group gemm

概述

将 FP8 grouped GEMM 从 rtp-llm 内置 C++/CUDA 实现迁移到外部 rtp_kernel 包（rtp_kernel.fp8_group_gemm）。删除约 2800 行 C++/CUDA 代码，替换为 Python 接口调用。同时升级 rtp-kernel 版本到 260317。

优点

大幅减少代码维护负担，净删除 ~2800 行 C++/CUDA 模板代码
将 kernel 实现集中到 rtp_kernel 包中统一维护，避免重复代码
pybind 注册、BUILD 依赖、类型存根清理干净，没有遗留引用
测试文件同步更新

建议改进

P1 - 重要

1. fp8_kernel.py 中 else 分支 profile=True 可能导致性能回退

在 cutlass_moe_mm_fp8_scaled 函数中，当 best_config is None（走 else 分支，使用默认 128x128x128 tile）时，新代码传了 profile=True。但旧代码这个分支传的是 False。这可能导致非 profile 路径也触发 profiling 行为，影响推理性能。

建议：else 分支应传 profile=False，与旧代码行为保持一致。

2. PR description 为空，缺少 API 变更说明

新的 fp8_grouped_gemm_ptpc 与旧的 cutlass_moe_mm 签名差异较大（新增 stage_count/mainloop_sched/epilogue_sched 参数，参数命名从 m_tile 改为 tile_m 等）。建议在 PR description 中说明 rtp-kernel 最低版本要求和 API 变更点，方便后续排查兼容性问题。

P2 - 建议

3. 测试中 swap_ab 逻辑改变了测试覆盖范围

cutlass_fp8_grouped_gemm_test.py 新增了动态 swap_ab 判断：

swap_ab = expert_offsets[num_expert].item() <= 64 * num_expert

旧代码默认 swap_ab=False。建议显式测试 swap_ab=True 和 swap_ab=False 两种路径，确保覆盖完整。

4. ARM lock 文件中 rtp-kernel 条目减少

requirements_lock_cuda12_arm.txt 删除了 3 行但只新增 2 行，看起来旧文件中有两处 rtp-kernel 条目。请确认这是有意去重还是遗漏。

总结

清晰的重构 PR，代码清理干净。主要风险是 else 分支 profile=True 可能引入性能回退，建议修复后合入。

moui0 requested a review from LLLLKKKK as a code owner March 20, 2026 03:37

moui0 requested review from Copilot and removed request for Copilot March 20, 2026 03:37

Copilot started reviewing on behalf of moui0 March 20, 2026 03:49 View session

moui0 requested review from Copilot and qqbbiu March 20, 2026 06:17

Copilot started reviewing on behalf of moui0 March 20, 2026 12:22 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

Comment thread rtp_llm/models_py/kernels/cuda/fp8_kernel/fp8_kernel.py

moui0 added 2 commits March 23, 2026 08:54

feat: update rtp-kernel

01bb444

feat: use rtp-kernel fp8 group gemm

5b1a62a

moui0 force-pushed the feat/replace_fp8_group_gemm branch from b988126 to 5b1a62a Compare March 23, 2026 01:52

moui0 enabled auto-merge (rebase) March 23, 2026 05:46

moui0 disabled auto-merge March 23, 2026 05:47

LLLLKKKK approved these changes Mar 23, 2026

View reviewed changes

LLLLKKKK merged commit b036adc into main Mar 23, 2026
7 of 9 checks passed

moui0 mentioned this pull request Mar 23, 2026

fix fp8 group gemm ut #821

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use rtp-kernel fp8 group gemm#807

feat: use rtp-kernel fp8 group gemm#807
LLLLKKKK merged 2 commits intomainfrom
feat/replace_fp8_group_gemm

moui0 commented Mar 20, 2026

Uh oh!

LLLLKKKK commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

LLLLKKKK commented Mar 20, 2026

Uh oh!

LLLLKKKK commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

moui0 commented Mar 20, 2026

Uh oh!

LLLLKKKK commented Mar 20, 2026

P1 (Important)

P2 (Suggestion)

Verdict

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

LLLLKKKK commented Mar 20, 2026

PR #807 — feat: use rtp-kernel fp8 group gemm | Code Review v2 (incremental)

v1 Finding Status

What changed in this push

CI Status

Verdict

Uh oh!

LLLLKKKK commented Mar 23, 2026

概述

优点

建议改进

P1 - 重要

P2 - 建议

总结

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants