Perf/mselehov/dpp subwave reduce v2 by michaelselehov · Pull Request #27 · ROCm/quadrants

michaelselehov · 2026-05-06T09:45:42Z

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Expose the AMDGPU v_mov_b32_dpp quad_perm:[1,0,3,2] instruction as a new subgroup intrinsic. This enables sub-wave parallelism patterns where adjacent lanes exchange and reduce values without going through LDS, cutting the reduction to a single VALU-pipe cycle. - Register the op in internal_ops.inc.h and type_system.cpp - AMDGPU codegen: emit llvm.amdgcn.update.dpp with ctrl=0xB1 (32-bit native, 64-bit via lo/hi split) - CPU/base codegen: guard with QD_ERROR + actionable message - Python binding: qd.simt.subgroup.dpp_swap_pairs(value) Assisted-by: Claude Opus

Desugar 3-argument range() into a while-loop at the AST level. The Quadrants IR does not natively support a step parameter in range-for, so range(start, stop, step) is lowered to: i = start while i < stop: <body> i += step This eliminates the need for manual while-loop workarounds when writing strided iteration patterns (e.g. sub-wave parallelism).

test_range_for_three_arguments now verifies correct strided iteration instead of expecting QuadrantsCompilationError. test_exception_in_node_with_body uses range() (0 args) as the invalid construct instead of range(1, 2, 3) which is now valid.

This patch reduces the maximum amount of threablocks launched per CU to 8, instead of 32. The result is a smaller number of threadblocks that have no work to do, on average testing. I see a 7% improvement in my local system.

lfmeadow

I see several whitespace-only changes, are they necessary?
Why the build_strided_range_for addition?
removal of import
In general I think it would be better to not modify the existing code even if you don't like some of it.

michaelselehov · 2026-05-06T16:05:01Z

I see several whitespace-only changes, are they necessary?

I've double-checked with Cursor, all the whitespace-only changes were to satisfy the linter.

Why the build_strided_range_for addition?

It's implementation of range(start, stop, step) - needed for subwave-parallelism.

removal of import

Also linter. It was complaining that the import is never used.

In general I think it would be better to not modify the existing code even if you don't like some of it.

See above, we were fixing linter complains in our changed files.

yaoliu13 · 2026-05-06T17:29:05Z

/run-ci

yaoliu13 · 2026-05-07T04:35:02Z

/run-ci

yaoliu13 · 2026-05-07T05:41:03Z

/run-ci

yaoliu13 · 2026-05-07T11:46:20Z

/run-ci

michaelselehov · 2026-05-07T14:13:11Z

/run-ci

All the issues are pre-existing (coming from amd-integration). Bar the Windows one which is due to the broken toolset.

diptorupd · 2026-05-08T19:13:36Z

@michaelselehov can you please rebase on top of amd-integration I just merged #29 #30 #32 that cleans up all the linter issues and will reduce the noise in your PR.

I have also disabled the github workflows. The CI run @yaoliu13 triggered is our internal AMDGPU CI.

michaelselehov and others added 6 commits May 6, 2026 04:35

chore: fix linter warnings (black, clang-format, ruff, trailing-ws)

9cc89ff

test: fix caret count and line offset in test_exception.py

63c9f68

Reduce GPU oversubscription.

1fbd844

This patch reduces the maximum amount of threablocks launched per CU to 8, instead of 32. The result is a smaller number of threadblocks that have no work to do, on average testing. I see a 7% improvement in my local system.

michaelselehov requested review from carlobertolli and doru1004 May 6, 2026 09:46

chore: fix clang-format violations in our changed files

673d7bd

lfmeadow reviewed May 6, 2026

View reviewed changes

yaoliu13 mentioned this pull request May 6, 2026

perf(amdgpu): add subgroupDppSwapPairs for intra-wavefront pair exchange #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf/mselehov/dpp subwave reduce v2#27

Perf/mselehov/dpp subwave reduce v2#27
michaelselehov wants to merge 7 commits intoamd-integrationfrom
perf/mselehov/dpp-subwave-reduce-v2

michaelselehov commented May 6, 2026

Uh oh!

lfmeadow left a comment

Uh oh!

michaelselehov commented May 6, 2026

Uh oh!

yaoliu13 commented May 6, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

michaelselehov commented May 7, 2026

Uh oh!

diptorupd commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

michaelselehov commented May 6, 2026

Brief Summary

Walkthrough

Uh oh!

lfmeadow left a comment

Choose a reason for hiding this comment

Uh oh!

michaelselehov commented May 6, 2026

Uh oh!

yaoliu13 commented May 6, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

yaoliu13 commented May 7, 2026

Uh oh!

michaelselehov commented May 7, 2026

Uh oh!

diptorupd commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

diptorupd commented May 8, 2026 •

edited

Loading