fix(v4): triton-3.6+ MoE SMEM OOM + empty-batch indexer prefill by valarLip · Pull Request #710 · ROCm/ATOM

valarLip · 2026-05-07T05:49:37Z

Summary

Two CI failures surfaced after #703 unblocked the V4-Pro accuracy step.

1. Triton MoE SMEM OOM on AMD CDNA4 (triton 3.6+)

triton_kernels.matmul_ogs_details.opt_flags_amd has a CDNA4 special case:

```python
if get_cdna_version() == 4 and block_m == 128:
block_n = 512
```

Producing BLOCK_M*BLOCK_N = 64K FP32 acc entries. triton 3.6+/3.7+ spills the accumulator to LDS more aggressively than 3.5, exceeding MI355X's 160 KiB LDS budget (observed 269 KiB required vs 163 KiB hardware limit on V4-Pro FP8 MoE).

Fix: wrap matmul_ogs calls with a CDNA4-only context manager that pins `block_m=64` / `block_n=256` (BLOCK_M*BLOCK_N = 16K, fits comfortably). Tunable via `ATOM_TRITON_MOE_BLOCK_{M,N}` env vars. Other GPU families and triton ≤3.5 paths are unaffected.

2. `cp_gather_indexer_k_quant_cache` HIP "invalid configuration argument"

When the indexer prefill builder produces `cu_committed_cpu[-1] == 0` (fresh prefill with prompt shorter than the CSA `ratio`), the kernel grid:

```cpp
dim3((num_tokens + BLOCK_Y_SIZE - 1) / BLOCK_Y_SIZE, ...)
```

becomes `grid.x = 0` and the HIP launcher rejects it.

Fix: clamp `cu_committed_cpu[-1]` to ≥1 in `_build_v4_indexer_meta`. The dummy +1 row gets gathered from the last seq's first cache block but is never read downstream — `fp8_mqa_logits` and `top_k_per_row_prefill` honor per-token `cu_starts`/`cu_ends` derived from `cu_committed_gpu[:-1]` and `n_committed_per_seq`, which both remain 0. Output stays all -1 sentinels, matching the all-empty semantics.

This is pure host-side scalar arithmetic on a value already host-synced (`int(cu_committed_cpu[-1])`); no new CG/torch.compile graph branch is introduced.

Local Verification

triton 3.7.0+amd.rocm7.1.0 (closest to CI's 3.6.0+rocm7.2.3, both reproduce the OOM):

DeepSeek-V4-Pro server starts successfully (no SMEM OOM)
1-token "Hi" curl returns successfully (was crashing pre-fix)
GSM8K-50 fewshot=5 = 0.94 (matches pre-fix(v4-ci): unblock nightly Docker test + V4-Pro accuracy step #703 baseline 0.92~0.96)

Test plan

Re-run V4-Pro accuracy CI step
Verify other AMD CDNA4 MoE workloads (gpt-oss-120b, Kimi-K2.5) regress at most by perf, not correctness

Two V4-Pro CI failures surfaced after PR #703 unblocked the loader path: 1. Triton MoE OOM on AMD CDNA4 with triton 3.6+/3.7+ - `triton_kernels.matmul_ogs_details.opt_flags_amd` has a CDNA4 special case `if cdna4 and block_m == 128: block_n = 512`, giving BLOCK_M*BLOCK_N = 64K FP32 acc entries. triton 3.6+ spills the accumulator to LDS more aggressively than 3.5, exceeding the MI355X 160 KiB LDS budget (observed 269 KiB). - Fix: wrap matmul_ogs calls with a CDNA4-only context manager that pins block_m=64 / block_n=256 (BLOCK_M*BLOCK_N = 16K, fits regs). Tunable via `ATOM_TRITON_MOE_BLOCK_{M,N}` env vars. - Other GPU families and triton ≤3.5 paths are unaffected. 2. `cp_gather_indexer_k_quant_cache` HIP "invalid configuration argument" when `cu_committed_cpu[-1] == 0` (fresh prefill with prompt shorter than the CSA `ratio`). The kernel grid is computed as `(num_tokens + BLOCK_Y_SIZE - 1) / BLOCK_Y_SIZE` so `num_tokens=0` makes grid.x=0 and the HIP launcher rejects it. - Fix: clamp `cu_committed_cpu[-1]` to ≥1 in the indexer-meta builder. The dummy +1 row is gathered from the last seq's first cache block but never read downstream, because `fp8_mqa_logits` and `top_k_per_row_prefill` honor per-token `cu_starts`/`cu_ends` derived from `cu_committed_gpu[:-1]` and `n_committed_per_seq`, both of which remain 0. Output stays all -1 sentinels, matching the all-empty semantics. Pure host-side scalar arithmetic on a value already host-synced; no CG/torch.compile graph branch added. Verified locally with triton 3.7.0+amd.rocm7.1.0: - DeepSeek-V4-Pro server starts (no OOM) - 1-token "Hi" curl returns successfully (was crashing pre-fix) - GSM8K-50 fewshot=5 = 0.94 (matches pre-PR-703 baseline)

Copilot

Pull request overview

This PR addresses two ROCm/CDNA4 regressions surfaced in CI for DeepSeek-V4: (1) Triton 3.6+ MoE matmul tile choices causing LDS/SMEM OOM on gfx950, and (2) an empty-committed prefill case producing a zero-sized HIP grid for the indexer FP8 gather path.

Changes:

Add a CDNA4 (gfx950)-scoped context manager to constrain triton_kernels.matmul_ogs tiling during MoE matmuls to avoid LDS overflow on Triton 3.6+.
Clamp the V4 indexer prefill metadata’s cu_committed_cpu[-1] to at least 1 to prevent cp_gather_indexer_k_quant_cache from launching with grid.x = 0.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`atom/model_ops/fused_moe_triton.py`	Adds a gfx950-only context manager wrapping `matmul_ogs` calls to constrain MoE tiling via env-tunable parameters.
`atom/model_ops/attentions/deepseek_v4_attn.py`	Adds an empty-committed guard by bumping the final `cu_committed` cumsum to avoid a zero-grid kernel launch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    Pin block_n ≤ ATOM_TRITON_MOE_MAX_BLOCK_N (default 256) so BLOCK_M*BLOCK_N
+    stays at 32K. Default block_n in compute_block_nk is already capped at
+    256 except for that single cdna4 branch, so this only sidesteps the bad


+    # acc), comfortably fitting MI355X's register file. Override via env if
+    # a future compiler/kernel update relaxes the budget.
+    block_m = int(os.getenv("ATOM_TRITON_MOE_BLOCK_M", "64"))
+    block_n = int(os.getenv("ATOM_TRITON_MOE_BLOCK_N", "256"))
+    update_opt_flags_constraints({"block_m": block_m, "block_n": block_n})


Copilot AI review requested due to automatic review settings May 7, 2026 06:06

Copilot started reviewing on behalf of valarLip May 7, 2026 06:07 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

valarLip merged commit 33c5464 into main May 7, 2026
25 of 32 checks passed

valarLip deleted the fix/v4-cdna4-moe-smem-and-empty-prefill branch May 7, 2026 06:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(v4): triton-3.6+ MoE SMEM OOM + empty-batch indexer prefill#710

fix(v4): triton-3.6+ MoE SMEM OOM + empty-batch indexer prefill#710
valarLip merged 1 commit intomainfrom
fix/v4-cdna4-moe-smem-and-empty-prefill

valarLip commented May 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

valarLip commented May 7, 2026

Summary

1. Triton MoE SMEM OOM on AMD CDNA4 (triton 3.6+)

2. `cp_gather_indexer_k_quant_cache` HIP "invalid configuration argument"

Local Verification

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants