Conversation
|
🤖 AI Code Review — PR #807 feat: use rtp-kernel fp8 group gemm Stats: +84 / -2901, 21 files changed P1 (Important)P1-1: The new call inserts 3 positional
P1-2: Test coverage gaps for full argument path The test only passes 12 positional args (up to
P1-3: The old code had
P2 (Suggestion)P2-1: Consider keyword arguments for 23 positional arguments is extremely error-prone. If the binding supports kwargs, use them for at least tile/cluster/profile/swap_ab: fp8_grouped_gemm_ptpc(
output, aq, w, aq_scale, w_scale,
expert_offsets, problem_sizes,
a_strides, b_strides, c_strides,
per_act_token, per_out_ch,
tile_m, tile_n, tile_k,
cluster_m, cluster_n, cluster_k,
0, 0, 0, # TODO: document these params
swap_ab=swap_ab,
profile=True,
)P2-2: Open-source wheel URL path inconsistency VerdictApprove with comments. The migration is structurally sound — all references are consistently updated and old C++ code is cleanly removed. Main asks: (1) document the 3 new positional params, (2) add test coverage for the full argument path, (3) clarify the dropped return value. |
There was a problem hiding this comment.
Pull request overview
This PR switches FP8 grouped GEMM for MoE from in-repo CUTLASS/pybind implementations to the rtp_kernel packaged implementation, removing the old bindings/kernel sources and updating call sites + deps accordingly.
Changes:
- Replace
compute_ops/pybind usage of FP8 grouped GEMM + MoE metadata helpers withrtp_kernel.fp8_group_gemmequivalents in MoE executors and FP8 kernel wrapper. - Remove CUTLASS FP8 grouped GEMM C++/CUDA sources, Bazel targets, and Python bindings/stubs for the old ops.
- Update GPU CUDA 12.9 requirements to pull the new
rtp-kernelwheel location; adjust the FP8 grouped GEMM test to callrtp_kerneldirectly.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| rtp_llm/ops/librtp_compute_ops/rtp_llm_ops.pyi | Removes cutlass_moe_mm + related helper symbols from the Python stub export surface. |
| rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/cutlass_w4a8_moe.py | Switches MoE offset/problem-size helpers to rtp_kernel.fp8_group_gemm and drops compute_ops dependency. |
| rtp_llm/models_py/modules/factory/fused_moe/impl/cuda/executors/cutlass_moe.py | Switches MoE offset/problem-size helpers to rtp_kernel.fp8_group_gemm and drops compute_ops dependency. |
| rtp_llm/models_py/kernels/cuda/test/cutlass_fp8_grouped_gemm_test.py | Updates test to call rtp_kernel.fp8_group_gemm.fp8_grouped_gemm_ptpc directly. |
| rtp_llm/models_py/kernels/cuda/fp8_kernel/fp8_kernel.py | Replaces compute_ops.cutlass_moe_mm calls with rtp_kernel grouped GEMM, adding new scheduling/profile args. |
| rtp_llm/models_py/bindings/cuda/RegisterCudaOps.cc | Removes pybind registrations for cutlass_moe_mm and MoE metadata helper functions. |
| rtp_llm/models_py/bindings/cuda/BUILD | Drops dependency on the removed fp8_group_cu CUTLASS target. |
| rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/moe_data.cu | Deletes in-repo MoE metadata CUDA implementation (moved to rtp_kernel). |
| rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/include/scaled_mm_epilogues_c3x.hpp | Deletes in-repo CUTLASS epilogue helpers for FP8 grouped GEMM. |
| rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/include/scalar_type.hpp | Deletes scalar-type support code used by the old CUTLASS FP8 grouped GEMM path. |
| rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/include/heuristics_search.hpp | Deletes in-repo heuristic config selection code for FP8 grouped GEMM. |
| rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/include/broadcast_load_epilogue_c3x.hpp | Deletes in-repo broadcast epilogue implementation used by the old path. |
| rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/include/broadcast_load_epilogue_array_c3x.hpp | Deletes in-repo array-broadcast epilogue implementation used by the old path. |
| rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/grouped_mm_c3x_sm90.cu | Deletes old CUTLASS SM90 grouped GEMM kernel implementation. |
| rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/grouped_mm_c3x.cuh | Deletes old CUTLASS grouped GEMM kernel header/launcher utilities. |
| rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/fp8_group_gemm.h | Deletes old public header for the removed grouped GEMM API. |
| rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/fp8_group_gemm.cu | Deletes old CUDA utility implementation used by the removed bindings. |
| rtp_llm/cpp/cuda/cutlass/cutlass_kernels/fp8_group_gemm/README.md | Removes documentation for the deleted in-repo grouped GEMM. |
| rtp_llm/cpp/cuda/cutlass/BUILD | Removes the fp8_group_cu Bazel library target. |
| open_source/deps/requirements_torch_gpu_cuda12_9.txt | Updates rtp-kernel wheel URL to a new path. |
| open_source/deps/requirements_lock_torch_gpu_cuda12_9.txt | Updates locked rtp-kernel wheel URL + hash to match the new artifact. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
PR #807 — feat: use rtp-kernel fp8 group gemm | Code Review v2 (incremental)Head SHA: v1 Finding Status
What changed in this push
CI Status
VerdictConditional approve. Code quality issues from v1 are addressed. The migration is clean and consistent. Must-verify before merge:
Follow-up: 🤖 Automated review by Claude | v1 review | v2 incremental |
b988126 to
5b1a62a
Compare
|
🤖 AI Code Review — PR #807 — feat: use rtp-kernel fp8 group gemm 概述将 FP8 grouped GEMM 从 rtp-llm 内置 C++/CUDA 实现迁移到外部 优点
建议改进P1 - 重要1. 在 建议:else 分支应传 2. PR description 为空,缺少 API 变更说明 新的 P2 - 建议3. 测试中
swap_ab = expert_offsets[num_expert].item() <= 64 * num_expert旧代码默认 4. ARM lock 文件中 rtp-kernel 条目减少
总结清晰的重构 PR,代码清理干净。主要风险是 else 分支 |
No description provided.