feat(amdgpu): wave-scope subgroupOr / subgroupAnd / subgroupBarrier#26
feat(amdgpu): wave-scope subgroupOr / subgroupAnd / subgroupBarrier#26lohiaj wants to merge 4 commits intoamd-integrationfrom
Conversation
Adds wave-scope subgroup boolean reductions and barrier to the AMDGPU LLVM
codegen path. `qd.simt.subgroup.reduce_or_i32(value)` and `reduce_and_i32(value)`
now produce single-instruction-equivalent wave reductions on AMDGPU
(amdgcn.icmp + ballot + s_or_b64 / s_and_b64 with EXEC after LLVM lowering),
matching what the SPIRV codegen emits via OpGroupNonUniformBitwiseOr / And.
`qd.simt.subgroup.barrier()` is also wired up on AMDGPU (was failing at
codegen with `LLVMRuntime function subgroupBarrier not found`).
Implementation follows the existing AMDGPU patch_intrinsic pattern in
llvm_context.cpp (e.g. block_barrier -> amdgcn_s_barrier,
cuda_shfl_down_sync_i32 -> amdgcn_ds_bpermute custom impl):
1. C++ stubs in runtime.cpp so the LLVM linker resolves the symbols
2. AMDGPU patch_intrinsic rewrites the stub bodies at codegen time using
amdgcn.icmp.i32 (ballot of arg != 0), amdgcn.ballot(true) (EXEC mask),
and icmp against 0 (OR) or against EXEC (AND)
3. Functions marked always_inline so the body inlines at the call site
Generated AMDGCN at the call site is the expected 4-instruction sequence:
v_cmp_ne_u32_e32 vcc, 0, v4
s_cmp_lg_u64 vcc, 0
s_cselect_b64 s[0:1], -1, 0
v_cndmask_b32_e64 v4, 0, 1, s[0:1]
v1 limitations (acknowledged):
- Boolean-only (i32 input treated as 0 vs non-zero). Generalising to
bitwise OR/AND of arbitrary i32 values needs DPP-based tree reduction
(~50 lines, clean follow-up).
- AMDGPU-only. CUDA path falls back to no-op stubs (CUDA users already
have cuda_ballot_i32 for this purpose).
- OR / AND only. Add / Mul / Min / Max / Xor variants are straightforward
extensions but kept out of v1 to keep PR scope small.
Tests: 3 new AMDGPU-gated tests in tests/python/test_simt.py:
- test_subgroup_reduce_or_i32_single_wave (block_dim=64, 1 wave)
- test_subgroup_reduce_and_i32_single_wave (block_dim=64, 1 wave)
- test_subgroup_reduce_or_i32_multi_wave (block_dim=128, 2 waves)
explicitly verifies WAVE-SCOPE (not block-scope) semantics: a non-zero
lane in wave 1 must NOT propagate to wave 0's outputs.
All 3 pass. Targeted regression sweep (test_simt + test_basics +
test_internal_func + test_shared_array_codegen_bug): 39 passed,
21 skipped, 0 failed.
Zero behavior change for code that doesn't use the new primitives.
|
The 4 failing CI checks ( Verified by checking PR #15 (the most recently merged PR into Per-failure breakdown:
The functional GPU tests for this PR were validated locally on an MI300X / gfx942 box: all 3 new tests in Happy to fix the pre-existing |
|
/run-ci |
1 similar comment
|
/run-ci |
|
1405358.1 and 4746 |
There was a problem hiding this comment.
LGTM overall. AMDGPU intrinsic patching looks correct and the multi-wave test gives confidence in wave-scope semantics.
Minor nits: please add reduce_or_i32 / reduce_and_i32 to __all__ in subgroup.py, and add a short note that non-AMDGPU stubs currently return 0 (backend-limited behavior).
|
/run-ci |
s_barrier is the workgroup-scope barrier and would deadlock when subgroupBarrier() is called from non-uniform control flow with block_dim > wave_size -- non-participating waves never arrive at the barrier (per Dipto's review). subgroupBarrier is meant to match SPIR-V OpControlBarrier with Subgroup execution scope. Switch the lowering to llvm.amdgcn.wave.barrier: a discardable compiler + memory-ordering barrier scoped to the current wave. It emits no hardware sync (the 64 lanes of an AMDGPU wave already execute in lockstep) and carries IntrConvergent + IntrHasSideEffects to prevent illegal CFG transforms and memory-op reordering. Add two AMDGPU regression tests in tests/python/test_simt.py: - test_subgroup_barrier_partial_participation: exercises the exact deadlock pattern from the review thread (block_dim=128, only lanes with `if i < 64:` call the barrier). With s_barrier this hung; with wave.barrier it completes. - test_subgroup_barrier_in_uniform_path: read-after-write across the barrier, verifies compiler ordering is preserved. Also addresses vecheruk-amd's review nits: - Add reduce_or_i32 / reduce_and_i32 to subgroup.py __all__. - Document that non-AMDGPU LLVM-runtime backends (CPU, CUDA today) fall through to the runtime stubs (return 0 / no-op) and that Vulkan/Metal users should keep using the polymorphic reduce_or / reduce_and. Tests: AMDGPU subgroup tests 5/5 pass. Targeted regression sweep (test_simt + test_basics + test_internal_func + test_shared_array_codegen_bug) clean: 41 passed, 21 skipped, 0 failed.
|
Pushed Dipto's
@qd.kernel
def k():
qd.loop_config(block_dim=128) # 2 waves
for i in range(128):
if i < 64:
qd.simt.subgroup.barrier() # only wave 0 arrives → wave 0 hangs forever
Added two regression tests to
vecheruk-amd's review nits
Tests
Diff summary: Ready for re-review @vecheruk-amd @dipto-amd. |
Pre-commit `clang-format` v19.1.7 wants
patch_intrinsic("subgroupBarrier", llvm::Intrinsic::amdgcn_wave_barrier,
false);
instead of the wrap I had. Pure whitespace; semantically identical.
|
Pushed - patch_intrinsic("subgroupBarrier",
- llvm::Intrinsic::amdgcn_wave_barrier, false);
+ patch_intrinsic("subgroupBarrier", llvm::Intrinsic::amdgcn_wave_barrier,
+ false);Verified locally with the project's pinned About the remaining
|
deepsek
left a comment
There was a problem hiding this comment.
lgtm after the change. Lets get a sign off from @diptorupd too.
|
/run-ci |
diptorupd
left a comment
There was a problem hiding this comment.
LGTM aside from the nits about wavesize.
|
/run-ci |
1 similar comment
|
/run-ci |
Per Dipto's review on PR #26: keep the implementation flexible enough to be predicated on wave size so it is upstream-friendly (the community cares about RDNA, all wave32) and forward-compatible with MI450 (expected wave32 per AMD). The wave-vote ballot intrinsics (`llvm.amdgcn.icmp` and `llvm.amdgcn.ballot`) are overloaded by their return-type argument: i64 on wave64 targets (CDNA1/2/3, gfx9xx) and i32 on wave32 targets (RDNA, expected MI450). The previous lowering hard-coded i64, which would either fail to lower or produce wrong code on a wave32 target. Resolution at codegen time: unsigned get_amdgpu_wave_size(llvm::Function *F): 1. If F has a `+wavefrontsize32` / `+wavefrontsize64` feature in its target-features attribute, honor that. 2. Else fall back to the mcpu-based default: gfx9xx -> 64, gfx10xx+ -> 32. Quadrants's AMDGPU codegen path currently sets target-features to "" (see llvm_context_pass.h:88), so step 2 is what's actually used today. The fallback list assumes `gfx9xx` is always wave64 -- true for every shipped MI300/MI325/MI355 part. If MI450 ships under a `gfx9...` mcpu prefix yet defaults to wave32, Quadrants's AMDGPU codegen path should also be updated to set "+wavefrontsize32" explicitly so step 1 picks it up; that's outside this PR's scope. The mask type is then used uniformly: - llvm.amdgcn.icmp.{mask_ty}.i32(arg, 0, ICMP_NE) -> ballot - For OR: ballot != 0 - For AND: ballot == amdgcn.ballot.{mask_ty}(true) (active-lane mask) No semantic change on wave64 hardware -- the i64 path is identical to what landed in eaba867. The new branch is purely additive for wave32. Tests - All 5 AMDGPU subgroup tests pass on MI300X/wave64: test_subgroup_reduce_or_i32_single_wave test_subgroup_reduce_and_i32_single_wave test_subgroup_reduce_or_i32_multi_wave test_subgroup_barrier_partial_participation test_subgroup_barrier_in_uniform_path - All 5 Vulkan subgroup tests pass. - Targeted regression sweep (test_simt + test_basics + test_internal_func + test_shared_array_codegen_bug): 41 passed, 21 skipped, 0 failed. - Wave32 path is correct-by-construction (symmetric to wave64) but untested locally -- no RDNA hardware in the bench setup.
|
@dipto-amd thanks again for the wavesize-flexibility ask. Pushed
Forward-looking note: Quadrants's AMDGPU codegen path currently sets Validation
Ready for another look whenever you have a moment |
|
/run-ci |
Summary
Adds wave-scope subgroup boolean reductions and barrier to the AMDGPU LLVM
codegen path.
qd.simt.subgroup.reduce_or_i32(value)andreduce_and_i32(value)now produce single-instruction-equivalent wave reductions on AMDGPU
(
v_cmp_ne_u32+s_cmp_lg_u64/s_cmp_eq_u64after LLVM lowering),matching what the SPIRV codegen already emits via
OpGroupNonUniformBitwiseOr/OpGroupNonUniformBitwiseAnd.qd.simt.subgroup.barrier()is also wired up on AMDGPU (was failing atcodegen with
LLVMRuntime function subgroupBarrier not found).Motivation
The Python frontend
qd.simt.subgroupexposes wave-scope primitives(
reduce_or,reduce_and,barrier, etc.) that compile cleanly on Vulkan/Metal via the SPIRV codegen but error at codegen time on the LLVM-runtime
archs (CPU/CUDA/AMDGPU) with
LLVMRuntime function subgroupOr not found.The error message is misleading — users assume the function doesn't exist
anywhere, not that it just isn't patched on their backend.
Without these primitives, AMDGPU users have to write LDS-based fallbacks
that cost ~75-100 cycles per iteration (1 LDS write + barrier + 64 LDS
reads with sequential OR-reduction) for what should be a 4-instruction
ballot. This PR closes the gap on AMDGPU for the boolean case (sufficient
for wave-vote convergence checks, the most common use case).
Implementation
Follows the existing AMDGPU intrinsic patches in this file (e.g.
block_barrier→amdgcn_s_barrier,cuda_shfl_down_sync_i32→amdgcn_ds_bpermute-based custom impl):runtime.cpp(no-op bodies) so the LLVM linker resolvesthe symbol.
patch_intrinsicinllvm_context.cpp(AMDGPU section) rewrites thestub body at codegen time to use:
amdgcn.icmp.i32for the wave-wide ballot ofarg != 0amdgcn.ballot(true)to read the EXEC mask (for AND case)icmpagainst zero (OR) or against EXEC (AND) for the boolean resultalways_inlineso the patcher's body becomes partof the calling kernel after linking.
Generated AMDGCN at the call site
```
v_cmp_ne_u32_e32 vcc, 0, v4 ; wave-wide ballot of (arg != 0)
s_cmp_lg_u64 vcc, 0 ; ballot != 0?
s_cselect_b64 s[0:1], -1, 0 ; predicate
v_cndmask_b32_e64 v4, 0, 1, s[0:1] ; zero-extend i1 -> i32
```
Files changed (204 lines, no deletions)
```
python/quadrants/lang/simt/subgroup.py | +22 Python frontend
quadrants/inc/internal_ops.inc.h | +7 PER_INTERNAL_OP registrations
quadrants/ir/type_system.cpp | +6 PLAIN_OP type signatures
quadrants/runtime/llvm/llvm_context.cpp | +58 AMDGPU patch_intrinsic
quadrants/runtime/llvm/runtime_module/runtime.cpp | +22 stub functions for linker
tests/python/test_simt.py | +89 3 new tests (single + multi-wave)
```
Test plan
`tests/python/test_simt.py` adds 3 AMDGPU-gated tests:
wave-scope, not block-scope.
correctly propagates within wave 1.
All 3 tests `PASSED` on gfx942.
Targeted regression sweep on the test files most likely to catch
codegen / IR regressions (`test_simt.py` + `test_basics.py` +
`test_internal_func.py` + `test_shared_array_codegen_bug.py`):
39 passed, 21 skipped, 0 failed.
Compatibility / risk
`subgroupAnd_i32`, or `subgroupBarrier`. Stubs are added but unused;
the patcher only runs when the function is referenced.
used by the SPIRV path is untouched. New names are
`subgroupOr_i32` / `subgroupAnd_i32` (type-suffixed, matching the
`cuda_shfl_down_sync_i32` / `cuda_shfl_down_sync_f32` convention).
Limitations / follow-ups (acknowledged scope cuts)
to true bitwise OR/AND of arbitrary i32 values needs DPP-based tree
reduction (~50 lines of additional codegen). This is sufficient for
the most common use case (wave-vote convergence checks); generalising
is a clean follow-up.
already have `cuda_ballot_i32` for this purpose.
straightforward extensions but not implemented here to keep PR scope
small.
Related
via the same `patch_intrinsic` mechanism. Same pattern, same file
organisation.
features without touching CUDA.