Skip to content

Perf/mselehov/dpp subwave reduce v2#27

Open
michaelselehov wants to merge 7 commits intoamd-integrationfrom
perf/mselehov/dpp-subwave-reduce-v2
Open

Perf/mselehov/dpp subwave reduce v2#27
michaelselehov wants to merge 7 commits intoamd-integrationfrom
perf/mselehov/dpp-subwave-reduce-v2

Conversation

@michaelselehov
Copy link
Copy Markdown

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

michaelselehov and others added 6 commits May 6, 2026 04:35
Expose the AMDGPU v_mov_b32_dpp quad_perm:[1,0,3,2] instruction as a
new subgroup intrinsic.  This enables sub-wave parallelism patterns
where adjacent lanes exchange and reduce values without going through
LDS, cutting the reduction to a single VALU-pipe cycle.

- Register the op in internal_ops.inc.h and type_system.cpp
- AMDGPU codegen: emit llvm.amdgcn.update.dpp with ctrl=0xB1
  (32-bit native, 64-bit via lo/hi split)
- CPU/base codegen: guard with QD_ERROR + actionable message
- Python binding: qd.simt.subgroup.dpp_swap_pairs(value)

Assisted-by: Claude Opus
Desugar 3-argument range() into a while-loop at the AST level.
The Quadrants IR does not natively support a step parameter in
range-for, so range(start, stop, step) is lowered to:

    i = start
    while i < stop:
        <body>
        i += step

This eliminates the need for manual while-loop workarounds when
writing strided iteration patterns (e.g. sub-wave parallelism).
test_range_for_three_arguments now verifies correct strided iteration
instead of expecting QuadrantsCompilationError.

test_exception_in_node_with_body uses range() (0 args) as the invalid
construct instead of range(1, 2, 3) which is now valid.
This patch reduces the maximum amount of threablocks launched per CU to 8, instead of 32.
The result is a smaller number of threadblocks that have no work to do, on average testing.
I see a 7% improvement in my local system.
Copy link
Copy Markdown

@lfmeadow lfmeadow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see several whitespace-only changes, are they necessary?
Why the build_strided_range_for addition?
removal of import
In general I think it would be better to not modify the existing code even if you don't like some of it.

@michaelselehov
Copy link
Copy Markdown
Author

I see several whitespace-only changes, are they necessary?

I've double-checked with Cursor, all the whitespace-only changes were to satisfy the linter.

Why the build_strided_range_for addition?

It's implementation of range(start, stop, step) - needed for subwave-parallelism.

removal of import

Also linter. It was complaining that the import is never used.

In general I think it would be better to not modify the existing code even if you don't like some of it.

See above, we were fixing linter complains in our changed files.

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 6, 2026

/run-ci

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 7, 2026

/run-ci

2 similar comments
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 7, 2026

/run-ci

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 7, 2026

/run-ci

@michaelselehov
Copy link
Copy Markdown
Author

/run-ci

All the issues are pre-existing (coming from amd-integration). Bar the Windows one which is due to the broken toolset.

@diptorupd
Copy link
Copy Markdown
Collaborator

diptorupd commented May 8, 2026

@michaelselehov can you please rebase on top of amd-integration I just merged #29 #30 #32 that cleans up all the linter issues and will reduce the noise in your PR.

I have also disabled the github workflows. The CI run @yaoliu13 triggered is our internal AMDGPU CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants