Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs by eble-amd · Pull Request #886 · ROCm/vllm

eble-amd · 2026-04-17T18:45:22Z

Purpose

Improve GEMV performance on Radeon 8060S and similar GPUs.

Test Plan

vllm benchmark with Gemma 2B AWQ
new cases in benchmark script
TODO: add an automated test comparing results to last-known-good numbers

Test Results

Copied from commit messages:

For the L2-size change:

Measured on Radeon 8060S (gfx1151, 2 MiB L2):
- 1x2048x16384: 142 -> 156 GiB/s (+10%)
- 1x32768x2048: 199 -> 199 GiB/s (no change)
- Gemma-2B AWQ decode TPOT: 10.86 -> 10.63 ms (-2%)

For the staggering change:

 Measured on Radeon 8060S (gfx1151):
  - 1x2048x16384: 156 -> 184 GiB/s (+18%)
  - 1x32768x2048: 199 -> 199 GiB/s (no change)
  - Gemma-2B AWQ decode TPOT: 10.63 -> 10.34 ms (-3%)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Dan Eble <Dan.Eble@amd.com>

When the int4 weight matrix exceeds L2 cache, wider memory loads (ACHUNK=32 vs 16) improve bandwidth by up to 10% on the wvSplitK_int4_g kernel. The L2 size is queried at runtime via hipDeviceProp, so the threshold adapts to different GPUs. Measured on Radeon 8060S (gfx1151, 2 MiB L2): - 1x2048x16384: 142 -> 156 GiB/s (+10%) - 1x32768x2048: 199 -> 199 GiB/s (no change) - Gemma-2B AWQ decode TPOT: 10.86 -> 10.63 ms (-2%) Add Gemma-2B AWQ and W4-L2-cache-boundary shapes to benchmark_hybrid_w4a16_gemm.py. Signed-off-by: Dan Eble <Dan.Eble@amd.com>

Experiments showed a performance hit when the weight row stride is a multiple of 4096 bytes. Offset each workgroup's K-loop start by one iteration stride and wrap around. A small residual s_sleep stagger covers cases where multiple workgroups still share an offset. Measured on Radeon 8060S (gfx1151): - 1x2048x16384: 156 -> 184 GiB/s (+18%) - 1x32768x2048: 199 -> 199 GiB/s (no change) - Gemma-2B AWQ decode TPOT: 10.63 -> 10.34 ms (-3%) Signed-off-by: Dan Eble <Dan.Eble@amd.com>

eble-amd added 3 commits April 17, 2026 11:20

wvSplitK_int4: add benchmark test cases

d174b0e

Signed-off-by: Dan Eble <Dan.Eble@amd.com>

eble-amd requested review from mgehre-amd and roberteg16 April 17, 2026 18:45

eble-amd changed the title ~~Skinny int4 perf~~ Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886

Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886
eble-amd wants to merge 3 commits intoROCm:gfx11from
eble-amd:skinny-int4-perf

eble-amd commented Apr 17, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eble-amd commented Apr 17, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eble-amd commented Apr 17, 2026 •

edited by github-actions Bot

Loading