Skip to content

feat: use rtp-kernel fp8 group gemm#807

Merged
LLLLKKKK merged 2 commits intomainfrom
feat/replace_fp8_group_gemm
Mar 23, 2026
Merged

feat: use rtp-kernel fp8 group gemm#807
LLLLKKKK merged 2 commits intomainfrom
feat/replace_fp8_group_gemm

Conversation

@moui0
Copy link
Copy Markdown
Collaborator

@moui0 moui0 commented Mar 20, 2026

No description provided.

@moui0 moui0 requested a review from LLLLKKKK as a code owner March 20, 2026 03:37
@moui0 moui0 requested review from Copilot and removed request for Copilot March 20, 2026 03:37
@LLLLKKKK
Copy link
Copy Markdown
Collaborator

🤖 AI Code Review — PR #807 feat: use rtp-kernel fp8 group gemm

Stats: +84 / -2901, 21 files changed


P1 (Important)

P1-1: fp8_grouped_gemm_ptpc — 3 undocumented positional args hardcoded to 0
rtp_llm/models_py/kernels/cuda/fp8_kernel/fp8_kernel.py

The new call inserts 3 positional 0 arguments between cluster_k and swap_ab. No comments explain what they control or whether 0 is a safe default. With 23 positional args total, this is fragile and hard to verify.

Recommendation: Add inline comments documenting these 3 parameters. Confirm with rtp-kernel maintainer that 0 is correct.

P1-2: Test coverage gaps for full argument path
rtp_llm/models_py/kernels/cuda/test/cutlass_fp8_grouped_gemm_test.py

The test only passes 12 positional args (up to per_out_ch), relying on defaults for everything else. The profile=True path (with tile/cluster selection), the 3 new params with non-zero values, and swap_ab=True are all untested.

Recommendation: Add at least one test case exercising the full argument list, especially the profile path.

P1-3: cutlass_moe_mm_fp8_scaled silently drops return value
rtp_llm/models_py/kernels/cuda/fp8_kernel/fp8_kernel.py

The old code had return cutlass_moe_mm(...). The new code calls fp8_grouped_gemm_ptpc(...) without return. While current callers rely on in-place output, this is an API behavior change.

Recommendation: Either consistently return the result, or add a comment confirming the function is purely in-place.


P2 (Suggestion)

P2-1: Consider keyword arguments for fp8_grouped_gemm_ptpc

23 positional arguments is extremely error-prone. If the binding supports kwargs, use them for at least tile/cluster/profile/swap_ab:

fp8_grouped_gemm_ptpc(
    output, aq, w, aq_scale, w_scale,
    expert_offsets, problem_sizes,
    a_strides, b_strides, c_strides,
    per_act_token, per_out_ch,
    tile_m, tile_n, tile_k,
    cluster_m, cluster_n, cluster_k,
    0, 0, 0,  # TODO: document these params
    swap_ab=swap_ab,
    profile=True,
)

P2-2: Open-source wheel URL path inconsistency
open_source/deps/requirements_torch_gpu_cuda12_9.txt — the open-source URL previously had no date-stamped subdirectory but now adds rtp_kernel_260317/. Minor inconsistency with internal URL structure.


Verdict

Approve with comments. The migration is structurally sound — all references are consistently updated and old C++ code is cleanly removed. Main asks: (1) document the 3 new positional params, (2) add test coverage for the full argument path, (3) clarify the dropped return value.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR switches FP8 grouped GEMM for MoE from in-repo CUTLASS/pybind implementations to the rtp_kernel packaged implementation, removing the old bindings/kernel sources and updating call sites + deps accordingly.

Changes:

  • Replace compute_ops/pybind usage of FP8 grouped GEMM + MoE metadata helpers with rtp_kernel.fp8_group_gemm equivalents in MoE executors and FP8 kernel wrapper.
  • Remove CUTLASS FP8 grouped GEMM C++/CUDA sources, Bazel targets, and Python bindings/stubs for the old ops.
  • Update GPU CUDA 12.9 requirements to pull the new rtp-kernel wheel location; adjust the FP8 grouped GEMM test to call rtp_kernel directly.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
rtp_llm/ops/librtp_compute_ops/rtp_llm_ops.pyi Removes cutlass_moe_mm + related helper symbols from the Python stub export surface.
rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/cutlass_w4a8_moe.py Switches MoE offset/problem-size helpers to rtp_kernel.fp8_group_gemm and drops compute_ops dependency.
rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/cutlass_moe.py Switches MoE offset/problem-size helpers to rtp_kernel.fp8_group_gemm and drops compute_ops dependency.
rtp_llm/models_py/kernels/cuda/test/cutlass_fp8_grouped_gemm_test.py Updates test to call rtp_kernel.fp8_group_gemm.fp8_grouped_gemm_ptpc directly.
rtp_llm/models_py/kernels/cuda/fp8_kernel/fp8_kernel.py Replaces compute_ops.cutlass_moe_mm calls with rtp_kernel grouped GEMM, adding new scheduling/profile args.
rtp_llm/models_py/bindings/cuda/RegisterCudaOps.cc Removes pybind registrations for cutlass_moe_mm and MoE metadata helper functions.
rtp_llm/models_py/bindings/cuda/BUILD Drops dependency on the removed fp8_group_cu CUTLASS target.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/moe_data.cu Deletes in-repo MoE metadata CUDA implementation (moved to rtp_kernel).
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/include/scaled_mm_epilogues_c3x.hpp Deletes in-repo CUTLASS epilogue helpers for FP8 grouped GEMM.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/include/scalar_type.hpp Deletes scalar-type support code used by the old CUTLASS FP8 grouped GEMM path.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/include/heuristics_search.hpp Deletes in-repo heuristic config selection code for FP8 grouped GEMM.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/include/broadcast_load_epilogue_c3x.hpp Deletes in-repo broadcast epilogue implementation used by the old path.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/include/broadcast_load_epilogue_array_c3x.hpp Deletes in-repo array-broadcast epilogue implementation used by the old path.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/grouped_mm_c3x_sm90.cu Deletes old CUTLASS SM90 grouped GEMM kernel implementation.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/grouped_mm_c3x.cuh Deletes old CUTLASS grouped GEMM kernel header/launcher utilities.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/fp8_group_gemm.h Deletes old public header for the removed grouped GEMM API.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/fp8_group_gemm.cu Deletes old CUDA utility implementation used by the removed bindings.
rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/README.md Removes documentation for the deleted in-repo grouped GEMM.
rtp_llm/cpp/cuda/cutlass/BUILD Removes the fp8_group_cu Bazel library target.
open_source/deps/requirements_torch_gpu_cuda12_9.txt Updates rtp-kernel wheel URL to a new path.
open_source/deps/requirements_lock_torch_gpu_cuda12_9.txt Updates locked rtp-kernel wheel URL + hash to match the new artifact.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rtp_llm/models_py/kernels/cuda/fp8_kernel/fp8_kernel.py
@LLLLKKKK
Copy link
Copy Markdown
Collaborator

PR #807 — feat: use rtp-kernel fp8 group gemm | Code Review v2 (incremental)

Head SHA: b988126f (prev: af7c5a29)
Stats: +82 / -2910, 21 files (GitHub) + 5 internal files


v1 Finding Status

# Finding Severity Status
P0-1 get_cutlass_moe_mm_without_permute_info / get_cutlass_batched_moe_mm_data may not exist in new wheel P0 ⚠️ Still needs runtime verification
P1-1 Undocumented positional args (0, 0, 0) P1 ✅ Fixed — now uses keyword args (stage_count=0, mainloop_sched=0, epilogue_sched=0)
P1-3 Test coverage gaps (profile/swap_ab/helper functions) P1 ❌ Not fixed — no new test cases added
P1-4 Dropped return in cutlass_moe_mm_fp8_scaled P1 ✅ Fixed — added -> None return type annotation
P2-1 Use keyword arguments P2 ✅ Fixed (same as P1-1)
P2-2 Wheel version still 0.1.0 P2 Acceptable — URL path + hash changed, lock files updated
P2-3 Only CUDA 12.9 requirements updated P2 ✅ Fixed — ARM and x86_64 variants all updated in internal source

What changed in this push

  1. fp8_kernel.py: profile=True branch now uses keyword args for stage_count, mainloop_sched, epilogue_sched, swap_ab, profile. profile=False branch simplified to 12 positional args only (dead-code tile values 128,128,128,1,1,1 removed). Return type annotated as -> None.

  2. Internal deps: arch_select.bzl, ARM requirements, and lock files all updated to rtp_kernel_260317 with correct SHA256 hashes.

  3. No changes to test file beyond the original cutlass_moe_mm -> fp8_grouped_gemm_ptpc swap.


CI Status

  • GitHub: build check in progress; all other checks (CodeQL, Agent, etc.) passed.
  • Internal: Code quality scan (COMPLETE) and auto-CR (SUCCESS) passed. No runtime test results yet.

Verdict

Conditional approve. Code quality issues from v1 are addressed. The migration is clean and consistent.

Must-verify before merge:

  1. Confirm rtp_kernel_260317 exports get_cutlass_moe_mm_without_permute_info and get_cutlass_batched_moe_mm_data from rtp_kernel.fp8_group_gemm (P0-1).
  2. Wait for GitHub build check to pass.

Follow-up:
3. Add test coverage for profile=True, swap_ab=True, and the two helper functions (P1-3).


🤖 Automated review by Claude | v1 review | v2 incremental

@moui0 moui0 force-pushed the feat/replace_fp8_group_gemm branch from b988126 to 5b1a62a Compare March 23, 2026 01:52
@moui0 moui0 enabled auto-merge (rebase) March 23, 2026 05:46
@moui0 moui0 disabled auto-merge March 23, 2026 05:47
@LLLLKKKK
Copy link
Copy Markdown
Collaborator

🤖 AI Code Review — PR #807 — feat: use rtp-kernel fp8 group gemm

概述

将 FP8 grouped GEMM 从 rtp-llm 内置 C++/CUDA 实现迁移到外部 rtp_kernel 包(rtp_kernel.fp8_group_gemm)。删除约 2800 行 C++/CUDA 代码,替换为 Python 接口调用。同时升级 rtp-kernel 版本到 260317。

优点

  • 大幅减少代码维护负担,净删除 ~2800 行 C++/CUDA 模板代码
  • 将 kernel 实现集中到 rtp_kernel 包中统一维护,避免重复代码
  • pybind 注册、BUILD 依赖、类型存根清理干净,没有遗留引用
  • 测试文件同步更新

建议改进

P1 - 重要

1. fp8_kernel.py 中 else 分支 profile=True 可能导致性能回退

cutlass_moe_mm_fp8_scaled 函数中,当 best_config is None(走 else 分支,使用默认 128x128x128 tile)时,新代码传了 profile=True。但旧代码这个分支传的是 False。这可能导致非 profile 路径也触发 profiling 行为,影响推理性能。

建议:else 分支应传 profile=False,与旧代码行为保持一致。

2. PR description 为空,缺少 API 变更说明

新的 fp8_grouped_gemm_ptpc 与旧的 cutlass_moe_mm 签名差异较大(新增 stage_count/mainloop_sched/epilogue_sched 参数,参数命名从 m_tile 改为 tile_m 等)。建议在 PR description 中说明 rtp-kernel 最低版本要求和 API 变更点,方便后续排查兼容性问题。

P2 - 建议

3. 测试中 swap_ab 逻辑改变了测试覆盖范围

cutlass_fp8_grouped_gemm_test.py 新增了动态 swap_ab 判断:

swap_ab = expert_offsets[num_expert].item() <= 64 * num_expert

旧代码默认 swap_ab=False。建议显式测试 swap_ab=Trueswap_ab=False 两种路径,确保覆盖完整。

4. ARM lock 文件中 rtp-kernel 条目减少

requirements_lock_cuda12_arm.txt 删除了 3 行但只新增 2 行,看起来旧文件中有两处 rtp-kernel 条目。请确认这是有意去重还是遗漏。

总结

清晰的重构 PR,代码清理干净。主要风险是 else 分支 profile=True 可能引入性能回退,建议修复后合入。

@LLLLKKKK LLLLKKKK merged commit b036adc into main Mar 23, 2026
7 of 9 checks passed
@moui0 moui0 mentioned this pull request Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants