Skip to content

Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886

Draft
eble-amd wants to merge 3 commits intoROCm:gfx11from
eble-amd:skinny-int4-perf
Draft

Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886
eble-amd wants to merge 3 commits intoROCm:gfx11from
eble-amd:skinny-int4-perf

Conversation

@eble-amd
Copy link
Copy Markdown

@eble-amd eble-amd commented Apr 17, 2026

Purpose

Improve GEMV performance on Radeon 8060S and similar GPUs.

Test Plan

  • vllm benchmark with Gemma 2B AWQ
  • new cases in benchmark script
  • TODO: add an automated test comparing results to last-known-good numbers

Test Results

Copied from commit messages:

For the L2-size change:

Measured on Radeon 8060S (gfx1151, 2 MiB L2):
- 1x2048x16384: 142 -> 156 GiB/s (+10%)
- 1x32768x2048: 199 -> 199 GiB/s (no change)
- Gemma-2B AWQ decode TPOT: 10.86 -> 10.63 ms (-2%)

For the staggering change:

 Measured on Radeon 8060S (gfx1151):
  - 1x2048x16384: 156 -> 184 GiB/s (+18%)
  - 1x32768x2048: 199 -> 199 GiB/s (no change)
  - Gemma-2B AWQ decode TPOT: 10.63 -> 10.34 ms (-3%)

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Dan Eble <Dan.Eble@amd.com>
When the int4 weight matrix exceeds L2 cache, wider memory loads
(ACHUNK=32 vs 16) improve bandwidth by up to 10% on the wvSplitK_int4_g
kernel.  The L2 size is queried at runtime via hipDeviceProp, so the
threshold adapts to different GPUs.

Measured on Radeon 8060S (gfx1151, 2 MiB L2):
- 1x2048x16384: 142 -> 156 GiB/s (+10%)
- 1x32768x2048: 199 -> 199 GiB/s (no change)
- Gemma-2B AWQ decode TPOT: 10.86 -> 10.63 ms (-2%)

Add Gemma-2B AWQ and W4-L2-cache-boundary shapes to
benchmark_hybrid_w4a16_gemm.py.

Signed-off-by: Dan Eble <Dan.Eble@amd.com>
Experiments showed a performance hit when the weight row stride is a
multiple of 4096 bytes.

Offset each workgroup's K-loop start by one iteration stride and wrap
around.  A small residual s_sleep stagger covers cases where multiple
workgroups still share an offset.

  Measured on Radeon 8060S (gfx1151):
  - 1x2048x16384: 156 -> 184 GiB/s (+18%)
  - 1x32768x2048: 199 -> 199 GiB/s (no change)
  - Gemma-2B AWQ decode TPOT: 10.63 -> 10.34 ms (-3%)

Signed-off-by: Dan Eble <Dan.Eble@amd.com>
@eble-amd eble-amd changed the title Skinny int4 perf Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant