Perf/mselehov/dpp subwave reduce v2#27
Perf/mselehov/dpp subwave reduce v2#27michaelselehov wants to merge 7 commits intoamd-integrationfrom
Conversation
Expose the AMDGPU v_mov_b32_dpp quad_perm:[1,0,3,2] instruction as a new subgroup intrinsic. This enables sub-wave parallelism patterns where adjacent lanes exchange and reduce values without going through LDS, cutting the reduction to a single VALU-pipe cycle. - Register the op in internal_ops.inc.h and type_system.cpp - AMDGPU codegen: emit llvm.amdgcn.update.dpp with ctrl=0xB1 (32-bit native, 64-bit via lo/hi split) - CPU/base codegen: guard with QD_ERROR + actionable message - Python binding: qd.simt.subgroup.dpp_swap_pairs(value) Assisted-by: Claude Opus
Desugar 3-argument range() into a while-loop at the AST level.
The Quadrants IR does not natively support a step parameter in
range-for, so range(start, stop, step) is lowered to:
i = start
while i < stop:
<body>
i += step
This eliminates the need for manual while-loop workarounds when
writing strided iteration patterns (e.g. sub-wave parallelism).
test_range_for_three_arguments now verifies correct strided iteration instead of expecting QuadrantsCompilationError. test_exception_in_node_with_body uses range() (0 args) as the invalid construct instead of range(1, 2, 3) which is now valid.
This patch reduces the maximum amount of threablocks launched per CU to 8, instead of 32. The result is a smaller number of threadblocks that have no work to do, on average testing. I see a 7% improvement in my local system.
lfmeadow
left a comment
There was a problem hiding this comment.
I see several whitespace-only changes, are they necessary?
Why the build_strided_range_for addition?
removal of import
In general I think it would be better to not modify the existing code even if you don't like some of it.
I've double-checked with Cursor, all the whitespace-only changes were to satisfy the linter.
It's implementation of
Also linter. It was complaining that the import is never used.
See above, we were fixing linter complains in our changed files. |
|
/run-ci |
|
/run-ci |
2 similar comments
|
/run-ci |
|
/run-ci |
All the issues are pre-existing (coming from amd-integration). Bar the Windows one which is due to the broken toolset. |
|
@michaelselehov can you please rebase on top of I have also disabled the github workflows. The CI run @yaoliu13 triggered is our internal AMDGPU CI. |
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough