Skip to content

feat(amdgpu): wave-scope subgroupOr / subgroupAnd / subgroupBarrier#26

Open
lohiaj wants to merge 4 commits intoamd-integrationfrom
feat/jlohia/amdgpu-subgroup-primitives
Open

feat(amdgpu): wave-scope subgroupOr / subgroupAnd / subgroupBarrier#26
lohiaj wants to merge 4 commits intoamd-integrationfrom
feat/jlohia/amdgpu-subgroup-primitives

Conversation

@lohiaj
Copy link
Copy Markdown

@lohiaj lohiaj commented May 4, 2026

Summary

Adds wave-scope subgroup boolean reductions and barrier to the AMDGPU LLVM
codegen path. qd.simt.subgroup.reduce_or_i32(value) and reduce_and_i32(value)
now produce single-instruction-equivalent wave reductions on AMDGPU
(v_cmp_ne_u32 + s_cmp_lg_u64 / s_cmp_eq_u64 after LLVM lowering),
matching what the SPIRV codegen already emits via
OpGroupNonUniformBitwiseOr / OpGroupNonUniformBitwiseAnd.

qd.simt.subgroup.barrier() is also wired up on AMDGPU (was failing at
codegen with LLVMRuntime function subgroupBarrier not found).

Motivation

The Python frontend qd.simt.subgroup exposes wave-scope primitives
(reduce_or, reduce_and, barrier, etc.) that compile cleanly on Vulkan/
Metal via the SPIRV codegen but error at codegen time on the LLVM-runtime
archs (CPU/CUDA/AMDGPU) with LLVMRuntime function subgroupOr not found.
The error message is misleading — users assume the function doesn't exist
anywhere, not that it just isn't patched on their backend.

Without these primitives, AMDGPU users have to write LDS-based fallbacks
that cost ~75-100 cycles per iteration (1 LDS write + barrier + 64 LDS
reads with sequential OR-reduction) for what should be a 4-instruction
ballot. This PR closes the gap on AMDGPU for the boolean case (sufficient
for wave-vote convergence checks, the most common use case).

Implementation

Follows the existing AMDGPU intrinsic patches in this file (e.g.
block_barrieramdgcn_s_barrier, cuda_shfl_down_sync_i32
amdgcn_ds_bpermute-based custom impl):

  1. C++ stubs in runtime.cpp (no-op bodies) so the LLVM linker resolves
    the symbol.
  2. patch_intrinsic in llvm_context.cpp (AMDGPU section) rewrites the
    stub body at codegen time to use:
    • amdgcn.icmp.i32 for the wave-wide ballot of arg != 0
    • amdgcn.ballot(true) to read the EXEC mask (for AND case)
    • icmp against zero (OR) or against EXEC (AND) for the boolean result
  3. Function is marked always_inline so the patcher's body becomes part
    of the calling kernel after linking.

Generated AMDGCN at the call site

```
v_cmp_ne_u32_e32 vcc, 0, v4 ; wave-wide ballot of (arg != 0)
s_cmp_lg_u64 vcc, 0 ; ballot != 0?
s_cselect_b64 s[0:1], -1, 0 ; predicate
v_cndmask_b32_e64 v4, 0, 1, s[0:1] ; zero-extend i1 -> i32
```

Files changed (204 lines, no deletions)

```
python/quadrants/lang/simt/subgroup.py | +22 Python frontend
quadrants/inc/internal_ops.inc.h | +7 PER_INTERNAL_OP registrations
quadrants/ir/type_system.cpp | +6 PLAIN_OP type signatures
quadrants/runtime/llvm/llvm_context.cpp | +58 AMDGPU patch_intrinsic
quadrants/runtime/llvm/runtime_module/runtime.cpp | +22 stub functions for linker
tests/python/test_simt.py | +89 3 new tests (single + multi-wave)
```

Test plan

`tests/python/test_simt.py` adds 3 AMDGPU-gated tests:

  1. `test_subgroup_reduce_or_i32_single_wave` — `block_dim=64`, one wave.
    • All zeros input → all outputs 0.
    • One non-zero lane → all outputs 1 (correct broadcast of OR result).
  2. `test_subgroup_reduce_and_i32_single_wave` — same setup.
    • All ones input → all outputs 1.
    • One zero lane → all outputs 0 (proves AND fires on a single dissent).
  3. `test_subgroup_reduce_or_i32_multi_wave` — `block_dim=128`, two waves.
    • Lane 64 (wave 1) is the only non-zero lane.
    • Wave 0 (lanes 0-63) outputs all zeros — proves the reduction is
      wave-scope, not block-scope.
    • Wave 1 (lanes 64-127) outputs all ones — proves the non-zero
      correctly propagates within wave 1.

All 3 tests `PASSED` on gfx942.

Targeted regression sweep on the test files most likely to catch
codegen / IR regressions (`test_simt.py` + `test_basics.py` +
`test_internal_func.py` + `test_shared_array_codegen_bug.py`):
39 passed, 21 skipped, 0 failed.

Compatibility / risk

  • Zero behavior change for any code that doesn't use `subgroupOr_i32`,
    `subgroupAnd_i32`, or `subgroupBarrier`. Stubs are added but unused;
    the patcher only runs when the function is referenced.
  • No deletions of existing code paths.
  • No SPIRV changes — the polymorphic `subgroupOr` / `subgroupAnd`
    used by the SPIRV path is untouched. New names are
    `subgroupOr_i32` / `subgroupAnd_i32` (type-suffixed, matching the
    `cuda_shfl_down_sync_i32` / `cuda_shfl_down_sync_f32` convention).

Limitations / follow-ups (acknowledged scope cuts)

  1. Boolean-only. i32 input is treated as 0 vs non-zero. Generalising
    to true bitwise OR/AND of arbitrary i32 values needs DPP-based tree
    reduction (~50 lines of additional codegen). This is sufficient for
    the most common use case (wave-vote convergence checks); generalising
    is a clean follow-up.
  2. AMDGPU-only. CUDA path falls back to no-op stubs. CUDA users
    already have `cuda_ballot_i32` for this purpose.
  3. OR / AND only. Add / Mul / Min / Max / Xor variants are
    straightforward extensions but not implemented here to keep PR scope
    small.

Related

Adds wave-scope subgroup boolean reductions and barrier to the AMDGPU LLVM
codegen path. `qd.simt.subgroup.reduce_or_i32(value)` and `reduce_and_i32(value)`
now produce single-instruction-equivalent wave reductions on AMDGPU
(amdgcn.icmp + ballot + s_or_b64 / s_and_b64 with EXEC after LLVM lowering),
matching what the SPIRV codegen emits via OpGroupNonUniformBitwiseOr / And.

`qd.simt.subgroup.barrier()` is also wired up on AMDGPU (was failing at
codegen with `LLVMRuntime function subgroupBarrier not found`).

Implementation follows the existing AMDGPU patch_intrinsic pattern in
llvm_context.cpp (e.g. block_barrier -> amdgcn_s_barrier,
cuda_shfl_down_sync_i32 -> amdgcn_ds_bpermute custom impl):

  1. C++ stubs in runtime.cpp so the LLVM linker resolves the symbols
  2. AMDGPU patch_intrinsic rewrites the stub bodies at codegen time using
     amdgcn.icmp.i32 (ballot of arg != 0), amdgcn.ballot(true) (EXEC mask),
     and icmp against 0 (OR) or against EXEC (AND)
  3. Functions marked always_inline so the body inlines at the call site

Generated AMDGCN at the call site is the expected 4-instruction sequence:
  v_cmp_ne_u32_e32 vcc, 0, v4
  s_cmp_lg_u64 vcc, 0
  s_cselect_b64 s[0:1], -1, 0
  v_cndmask_b32_e64 v4, 0, 1, s[0:1]

v1 limitations (acknowledged):
  - Boolean-only (i32 input treated as 0 vs non-zero). Generalising to
    bitwise OR/AND of arbitrary i32 values needs DPP-based tree reduction
    (~50 lines, clean follow-up).
  - AMDGPU-only. CUDA path falls back to no-op stubs (CUDA users already
    have cuda_ballot_i32 for this purpose).
  - OR / AND only. Add / Mul / Min / Max / Xor variants are straightforward
    extensions but kept out of v1 to keep PR scope small.

Tests: 3 new AMDGPU-gated tests in tests/python/test_simt.py:
  - test_subgroup_reduce_or_i32_single_wave  (block_dim=64, 1 wave)
  - test_subgroup_reduce_and_i32_single_wave (block_dim=64, 1 wave)
  - test_subgroup_reduce_or_i32_multi_wave   (block_dim=128, 2 waves)
    explicitly verifies WAVE-SCOPE (not block-scope) semantics: a non-zero
    lane in wave 1 must NOT propagate to wave 0's outputs.

All 3 pass. Targeted regression sweep (test_simt + test_basics +
test_internal_func + test_shared_array_codegen_bug): 39 passed,
21 skipped, 0 failed.

Zero behavior change for code that doesn't use the new primitives.
@lohiaj
Copy link
Copy Markdown
Author

lohiaj commented May 4, 2026

The 4 failing CI checks (Linters, Linux ubuntu-22.04 Build, pyright_linter / Pyright Linter, test_gpu / Start AMD GPU Runner) are pre-existing on amd-integration and unrelated to this PR.

Verified by checking PR #15 (the most recently merged PR into amd-integration, force_inline by @deepsek): the exact same 4 checks failed there too, and the PR was merged anyway.

Per-failure breakdown:

  1. Linters (black + clang-format) — black flagged tests/python/test_fn_attrs.py only ("1 file reformatted, 438 files left unchanged"); clang-format flagged quadrants/runtime/llvm/llvm_runtime_executor.cpp:723. Neither file is touched by this PR. My modified files (subgroup.py, test_simt.py, llvm_context.cpp, runtime.cpp, internal_ops.inc.h, type_system.cpp) all pass black 25.1.0 (-l 120), ruff 0.4.8, and pylint 3.3.4 with the project's exact .pre-commit-config.yaml settings.

  2. Linux ubuntu-22.04 Build — fails with RHI Error: Can not create Vulkan instance. Vulkan-on-CI infra issue, no code in this PR touches the RHI/Vulkan path.

  3. pyright_linter — flags torch private import errors in _func_base.py, _py_tensor.py, field.py, util.py. None of those files are in this PR's diff.

  4. test_gpu / Start AMD GPU Runner — failed in 3 seconds (runner didn't start). Infrastructure.

The functional GPU tests for this PR were validated locally on an MI300X / gfx942 box: all 3 new tests in test_simt.py PASS, and the targeted regression sweep (test_simt.py + test_basics.py + test_internal_func.py + test_shared_array_codegen_bug.py) reports 39 passed, 21 skipped, 0 failed.

Happy to fix the pre-existing test_fn_attrs.py black drift in a follow-up if it's helpful, but keeping this PR focused on the one feature.

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 5, 2026

/run-ci

1 similar comment
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 5, 2026

/run-ci

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 5, 2026

1405358.1 and 4746

@vecheruk-amd vecheruk-amd self-requested a review May 6, 2026 03:25
Copy link
Copy Markdown

@vecheruk-amd vecheruk-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. AMDGPU intrinsic patching looks correct and the multi-wave test gives confidence in wave-scope semantics.
Minor nits: please add reduce_or_i32 / reduce_and_i32 to __all__ in subgroup.py, and add a short note that non-AMDGPU stubs currently return 0 (backend-limited behavior).

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 6, 2026

/run-ci

s_barrier is the workgroup-scope barrier and would deadlock when
subgroupBarrier() is called from non-uniform control flow with
block_dim > wave_size -- non-participating waves never arrive at the
barrier (per Dipto's review). subgroupBarrier is meant to match
SPIR-V OpControlBarrier with Subgroup execution scope.

Switch the lowering to llvm.amdgcn.wave.barrier: a discardable
compiler + memory-ordering barrier scoped to the current wave. It
emits no hardware sync (the 64 lanes of an AMDGPU wave already
execute in lockstep) and carries IntrConvergent + IntrHasSideEffects
to prevent illegal CFG transforms and memory-op reordering.

Add two AMDGPU regression tests in tests/python/test_simt.py:
- test_subgroup_barrier_partial_participation: exercises the exact
  deadlock pattern from the review thread (block_dim=128, only
  lanes with `if i < 64:` call the barrier). With s_barrier this
  hung; with wave.barrier it completes.
- test_subgroup_barrier_in_uniform_path: read-after-write across
  the barrier, verifies compiler ordering is preserved.

Also addresses vecheruk-amd's review nits:
- Add reduce_or_i32 / reduce_and_i32 to subgroup.py __all__.
- Document that non-AMDGPU LLVM-runtime backends (CPU, CUDA today)
  fall through to the runtime stubs (return 0 / no-op) and that
  Vulkan/Metal users should keep using the polymorphic reduce_or
  / reduce_and.

Tests: AMDGPU subgroup tests 5/5 pass. Targeted regression sweep
(test_simt + test_basics + test_internal_func +
test_shared_array_codegen_bug) clean: 41 passed, 21 skipped,
0 failed.
@lohiaj
Copy link
Copy Markdown
Author

lohiaj commented May 6, 2026

Pushed e46931bd7 addressing both rounds of review:

Dipto's s_barrierwave.barrier fix (the key one)

subgroupBarrier now lowers to llvm.amdgcn.wave.barrier instead of llvm.amdgcn.s.barrier in llvm_context.cpp. As Dipto called out in the internal thread, s_barrier is the workgroup-scope barrier and would have deadlocked on partial-wave participation - the canonical foot-gun being:

@qd.kernel
def k():
    qd.loop_config(block_dim=128)  # 2 waves
    for i in range(128):
        if i < 64:
            qd.simt.subgroup.barrier()  # only wave 0 arrives → wave 0 hangs forever

wave.barrier is the right call: per LLVM IntrinsicsAMDGPU.td, it's a discardable barrier carrying IntrConvergent + IntrHasSideEffects — compiler + memory-ordering only, emits no hardware sync, scoped to the current wave (the 64 lanes of an AMDGPU wave already execute in lockstep so there's nothing for the hardware to wait on). No partial-participation hazard.

Added two regression tests to tests/python/test_simt.py to prevent recurrence:

  • test_subgroup_barrier_partial_participation - runs the exact block_dim=128 / if i < 64: pattern from the thread. With s_barrier this hung; with wave.barrier it completes.
  • test_subgroup_barrier_in_uniform_path - read-after-write across the barrier in a uniform path, sanity-checks compiler ordering is preserved.

vecheruk-amd's review nits

  1. reduce_or_i32, reduce_and_i32 added to __all__ in subgroup.py.
  2. Module docstring + barrier() docstring + a corresponding comment block in runtime.cpp explaining backend availability - AMDGPU has the full patcher; CPU/CUDA LLVM-runtime backends currently fall through to the runtime stubs (return 0 / no-op) and that's intentional placeholder behaviour for follow-up; Vulkan/Metal users should keep using the polymorphic reduce_or / reduce_and.

Tests

  • AMDGPU subgroup tests: 5/5 PASS (test_subgroup_reduce_or_i32_single_wave, test_subgroup_reduce_and_i32_single_wave, test_subgroup_reduce_or_i32_multi_wave, test_subgroup_barrier_partial_participation, test_subgroup_barrier_in_uniform_path).
  • Targeted regression sweep (test_simt + test_basics + test_internal_func + test_shared_array_codegen_bug): 41 passed, 21 skipped, 0 failed (up from 39 in the prior round because of the 2 new barrier tests).

Diff summary: +134 / -16 across 4 files. Net PR is now +338 / -16 across 6 files.

Ready for re-review @vecheruk-amd @dipto-amd.

Pre-commit `clang-format` v19.1.7 wants
  patch_intrinsic("subgroupBarrier", llvm::Intrinsic::amdgcn_wave_barrier,
                  false);
instead of the wrap I had. Pure whitespace; semantically identical.
@lohiaj
Copy link
Copy Markdown
Author

lohiaj commented May 6, 2026

Pushed 2c2300259clang-format fix for the new patch_intrinsic("subgroupBarrier", …) call I added in e46931bd7. Two-line whitespace-only diff:

-      patch_intrinsic("subgroupBarrier",
-                      llvm::Intrinsic::amdgcn_wave_barrier, false);
+      patch_intrinsic("subgroupBarrier", llvm::Intrinsic::amdgcn_wave_barrier,
+                      false);

Verified locally with the project's pinned clang-format v19.1.7: --dry-run --Werror passes on lines 583–610 (the range I touched). Subgroup tests still 5/5 PASS. Broader regression sweep still 41 passed / 21 skipped / 0 failed.

About the remaining Linters failures on this CI run

The job will still flag clang-format / black violations on three files that this PR doesn't touch:

Same situation as the prior CI run on eaba867ac, and the same pattern that PR #15 was merged with (CI status failure, but the failing files are unrelated drift). Confirmed by diffing the file lists between the two Linters runs — my push has REDUCED the flagged-file count (the prior run on eaba867ac flagged 12 files; this run flags 3 — my clang-format fix cleared the rest).

Happy to push a chore: format pre-existing drift commit on top if it's preferred to keep the PR's CI fully green, but it would touch four files outside this PR's feature scope. Otherwise the same merge-despite-pre-existing-Linters-failure precedent from PR #15 should apply.

@deepsek deepsek self-requested a review May 6, 2026 17:22
Copy link
Copy Markdown
Collaborator

@deepsek deepsek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm after the change. Lets get a sign off from @diptorupd too.

@deepsek deepsek requested a review from diptorupd May 6, 2026 17:25
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 6, 2026

/run-ci

Comment thread quadrants/runtime/llvm/llvm_context.cpp
Comment thread quadrants/runtime/llvm/llvm_context.cpp Outdated
Comment thread quadrants/runtime/llvm/llvm_context.cpp Outdated
Copy link
Copy Markdown
Collaborator

@diptorupd diptorupd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM aside from the nits about wavesize.

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 7, 2026

/run-ci

1 similar comment
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 7, 2026

/run-ci

Per Dipto's review on PR #26: keep the implementation flexible enough
to be predicated on wave size so it is upstream-friendly (the
community cares about RDNA, all wave32) and forward-compatible with
MI450 (expected wave32 per AMD).

The wave-vote ballot intrinsics (`llvm.amdgcn.icmp` and
`llvm.amdgcn.ballot`) are overloaded by their return-type argument:
i64 on wave64 targets (CDNA1/2/3, gfx9xx) and i32 on wave32 targets
(RDNA, expected MI450). The previous lowering hard-coded i64, which
would either fail to lower or produce wrong code on a wave32 target.

Resolution at codegen time:

  unsigned get_amdgpu_wave_size(llvm::Function *F):
    1. If F has a `+wavefrontsize32` / `+wavefrontsize64` feature in
       its target-features attribute, honor that.
    2. Else fall back to the mcpu-based default: gfx9xx -> 64,
       gfx10xx+ -> 32.

Quadrants's AMDGPU codegen path currently sets target-features to ""
(see llvm_context_pass.h:88), so step 2 is what's actually used today.
The fallback list assumes `gfx9xx` is always wave64 -- true for every
shipped MI300/MI325/MI355 part. If MI450 ships under a `gfx9...` mcpu
prefix yet defaults to wave32, Quadrants's AMDGPU codegen path should
also be updated to set "+wavefrontsize32" explicitly so step 1 picks
it up; that's outside this PR's scope.

The mask type is then used uniformly:
  - llvm.amdgcn.icmp.{mask_ty}.i32(arg, 0, ICMP_NE)  ->  ballot
  - For OR: ballot != 0
  - For AND: ballot == amdgcn.ballot.{mask_ty}(true)  (active-lane mask)

No semantic change on wave64 hardware -- the i64 path is identical to
what landed in eaba867. The new branch is purely additive for
wave32.

Tests
  - All 5 AMDGPU subgroup tests pass on MI300X/wave64:
      test_subgroup_reduce_or_i32_single_wave
      test_subgroup_reduce_and_i32_single_wave
      test_subgroup_reduce_or_i32_multi_wave
      test_subgroup_barrier_partial_participation
      test_subgroup_barrier_in_uniform_path
  - All 5 Vulkan subgroup tests pass.
  - Targeted regression sweep (test_simt + test_basics +
    test_internal_func + test_shared_array_codegen_bug):
    41 passed, 21 skipped, 0 failed.
  - Wave32 path is correct-by-construction (symmetric to wave64) but
    untested locally -- no RDNA hardware in the bench setup.
@lohiaj
Copy link
Copy Markdown
Author

lohiaj commented May 7, 2026

@dipto-amd thanks again for the wavesize-flexibility ask. Pushed 427919ea3 on top of the wave-barrier commit:

  • New get_amdgpu_wave_size(llvm::Function *F) helper: prefers the function's +wavefrontsize{32,64} feature in its target-features attribute, falls back to the mcpu-based default (gfx9xx -> 64, gfx10xx+ -> 32) when the attribute is empty.
  • subgroupOr_i32 / subgroupAnd_i32 lowering now picks the ballot mask type from that helper (i64Ty on wave64, i32Ty on wave32) and threads it through llvm.amdgcn.icmp.{mask_ty}.i32 and llvm.amdgcn.ballot.{mask_ty}. No behavior change on wave64.
  • subgroupBarrier doesn't need wavesize predication. llvm.amdgcn.wave.barrier is wave-scope by definition (no return type, discarded at codegen on every subtarget).

Forward-looking note: Quadrants's AMDGPU codegen path currently sets target-features="" (llvm_context_pass.h:88), so the mcpu fallback is what actually fires today. If MI450 ships under a gfx9... mcpu prefix but defaults to wave32, Quadrants's codegen path will need to set "+wavefrontsize32" explicitly so the feature-attribute branch picks it up. That'd be a separate small follow-up PR. out of scope here.

Validation

  • All 5 AMDGPU subgroup tests pass on MI300X (wave64): 3 reduce + 2 barrier (incl. test_subgroup_barrier_partial_participation which is the exact deadlock pattern from your review).
  • All 5 Vulkan subgroup tests pass.
  • Targeted regression sweep (test_simt + test_basics + test_internal_func + test_shared_array_codegen_bug): 41 passed, 21 skipped, 0 failed.
  • Wave32 path is correct-by-construction (symmetric to wave64) but I have no RDNA hardware locally to exercise it

Ready for another look whenever you have a moment

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 7, 2026

/run-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants