Hybrid W4A16 quant kernel perf regression tests by eble-amd · Pull Request #898 · ROCm/vllm

eble-amd · 2026-04-24T20:26:41Z

Purpose

Add automated PR verification tests covering the performance of the hybrid W4A16 quant kernels (HIP + Triton).

Test Plan

Generate initial golden values on a developer system.
The update workflow is documented in the new file ./tests/kernels/quantization/golden/README.md.
Rerun a few times locally to check for intermittency.
Rerun a few times in CI to check for intermittency.

Test Result

See CI job logs.

Click here for an excerpt of a passing run

tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2048-o2048-g128-hybrid-w4a16] PASSED [  2%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2048-o2048-g128-hybrid-w4a16-zp] PASSED [  5%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2048-o2560-g128-hybrid-w4a16] PASSED [  8%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2048-o2560-g128-hybrid-w4a16-zp] PASSED [ 11%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2048-o32768-g128-hybrid-w4a16] PASSED [ 14%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2048-o32768-g128-hybrid-w4a16-zp] PASSED [ 17%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2560-o2560-g128-hybrid-w4a16] PASSED [ 20%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2560-o2560-g128-hybrid-w4a16-zp] PASSED [ 23%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2560-o3840-g128-hybrid-w4a16] PASSED [ 26%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2560-o3840-g128-hybrid-w4a16-zp] PASSED [ 29%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2560-o19456-g128-hybrid-w4a16] PASSED [ 32%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2560-o19456-g128-hybrid-w4a16-zp] PASSED [ 35%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i3584-o3584-g128-hybrid-w4a16] PASSED [ 38%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i3584-o3584-g128-hybrid-w4a16-zp] PASSED [ 41%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i3584-o4608-g128-hybrid-w4a16] PASSED [ 44%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i3584-o4608-g128-hybrid-w4a16-zp] PASSED [ 47%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i3584-o37888-g128-hybrid-w4a16] PASSED [ 50%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i3584-o37888-g128-hybrid-w4a16-zp] PASSED [ 52%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i8192-o512-g128-hybrid-w4a16] PASSED [ 55%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i8192-o512-g128-hybrid-w4a16-zp] PASSED [ 58%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i8320-o512-g128-hybrid-w4a16] PASSED [ 61%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i8320-o512-g128-hybrid-w4a16-zp] PASSED [ 64%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i9728-o2560-g128-hybrid-w4a16] PASSED [ 67%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i9728-o2560-g128-hybrid-w4a16-zp] PASSED [ 70%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i16384-o2048-g128-hybrid-w4a16] PASSED [ 73%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i16384-o2048-g128-hybrid-w4a16-zp] PASSED [ 76%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i18944-o3584-g128-hybrid-w4a16] PASSED [ 79%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i18944-o3584-g128-hybrid-w4a16-zp] PASSED [ 82%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i38912-o2048-g128-hybrid-w4a16] PASSED [ 85%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i38912-o2048-g128-hybrid-w4a16-zp] PASSED [ 88%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i49152-o2048-g128-hybrid-w4a16] PASSED [ 91%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i49152-o2048-g128-hybrid-w4a16-zp] PASSED [ 94%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i49152-o4096-g128-hybrid-w4a16] PASSED [ 97%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i49152-o4096-g128-hybrid-w4a16-zp] PASSED [100%]

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mgehre-amd · 2026-05-05T20:38:11Z

+    from vllm.utils.platform_utils import num_compute_units
+
+    device = "cuda"
+    dtype = torch.float16


We see different performance for bfloat16 which is needed for some models. Can we store the dtype in the SHAPES array and golden refs? This also affects the dtype of w_s_skinny

mgehre-amd · 2026-05-05T21:00:30Z

On my machine, some tests fail:

E             batch_size=128: 23.05 TFLOP/s (expected 20.04 + [-15, 15]%) FAIL (improvement, +15.1%) [hybrid_triton_w4a16] [72°C]
E             batch_size=64: 8.82 TFLOP/s (expected 7.46 + [-15, 15]%) FAIL (improvement, +18.3%) [hybrid_triton_w4a16] [73°C]
E             batch_size=64: 18.95 TFLOP/s (expected 16.46 + [-15, 15]%) FAIL (improvement, +15.1%) [hybrid_triton_w4a16] [71°C]
E             batch_size=64: 18.84 TFLOP/s (expected 15.92 + [-15, 15]%) FAIL (improvement, +18.4%) [hybrid_triton_w4a16] [68°C]
E             batch_size=64: 21.28 TFLOP/s (expected 17.19 + [-15, 15]%) FAIL (improvement, +23.8%) [hybrid_triton_w4a16] [75°C]
E             batch_size=64: 20.80 TFLOP/s (expected 17.11 + [-15, 15]%) FAIL (improvement, +21.6%) [hybrid_triton_w4a16] [74°C]
E             batch_size=64: 20.41 TFLOP/s (expected 17.08 + [-15, 15]%) FAIL (improvement, +19.5%) [hybrid_triton_w4a16] [75°C]
E             batch_size=64: 19.89 TFLOP/s (expected 16.92 + [-15, 15]%) FAIL (improvement, +17.5%) [hybrid_triton_w4a16] [74°C]
E             batch_size=64: 20.78 TFLOP/s (expected 17.76 + [-15, 15]%) FAIL (improvement, +17.0%) [hybrid_triton_w4a16] [82°C]
E             batch_size=128: 10.92 TFLOP/s (expected 9.40 + [-15, 15]%) FAIL (improvement, +16.2%) [hybrid_triton_w4a16] [73°C]
E             batch_size=512: 21.10 TFLOP/s (expected 18.18 + [-15, 15]%) FAIL (improvement, +16.1%) [hybrid_triton_w4a16] [80°C]
E             batch_size=64: 18.81 TFLOP/s (expected 15.45 + [-15, 15]%) FAIL (improvement, +21.8%) [hybrid_triton_w4a16] [80°C]
E             batch_size=64: 19.85 TFLOP/s (expected 16.95 + [-15, 15]%) FAIL (improvement, +17.1%) [hybrid_triton_w4a16] [86°C]
E             batch_size=32: 8.74 TFLOP/s (expected 7.58 + [-15, 15]%) FAIL (improvement, +15.2%) [hybrid_triton_w4a16] [78°C]
E             batch_size=64: 7.08 TFLOP/s (expected 6.04 + [-15, 15]%) FAIL (improvement, +17.2%) [hybrid_triton_w4a16] [74°C]
E             batch_size=128: 6.45 TFLOP/s (expected 9.14 + [-15, 15]%) FAIL (regression, -29.4%) [hybrid_triton_w4a16] [74°C]
E             batch_size=256: 13.64 TFLOP/s (expected 11.82 + [-15, 15]%) FAIL (improvement, +15.4%) [hybrid_triton_w4a16] [81°C]

Maybe we can exclude the hybrid_triton_w4a16 for now and only keep the hip kernels with look pretty stable?

mgehre-amd · 2026-05-05T21:02:04Z

For the N=1,2,4 i.e. wvsplitk_int4, can we show GiB/s instead of TFLOP/s on the benchmark output? Those are memory-bound and we want them to be close to 230 GiB/s; TFLOP/s are meaningless for them.
Maybe that means we should store the golden values as "ms" and just convert on display?

eble-amd · 2026-05-06T15:32:26Z

The last push was just a rebase.

eble-amd · 2026-05-06T17:06:57Z

The last push changes the golden-generation workflow to use the option --write-golden and write directly to the golden file instead of a file on the side.

It also updates golden values after rebasing and fetching the latest nightly ROCm.

eble-amd · 2026-05-06T17:33:46Z

Maybe we can exclude the hybrid_triton_w4a16 for now and only keep the hip kernels with look pretty stable?

Yes, I think we will have to do that, because the Triton tests are failing similarly on the CI runner.

eble-amd · 2026-05-06T18:50:23Z

The last push sets the tolerance on the Triton kernel to ±80%. The intent is to continue checking that the Triton kernel is selected in the right cases, but not to worry about its performance unless the change is egregious.

The last push also cuts the inter-test cool-down delay short if the temperature is already below 60 °C. In my testing, this saved about 30 s when starting cold, but offered no benefit when already warm:

run 1: 139.95s
run 2: 173.97s
run 3: 176.90s

The last push also narrows the tolerance on the HIP kernel to ±8%.

eble-amd · 2026-05-06T22:37:19Z

Reviewers, I'm sorry for such a noisy PR. I'm converting it back to a draft.

Two kinds of runners are currently grouped under one label, and performance seems to depend on which kind gets the job. DevOps will relabel the runners so that we can constrain where these tests are run.

Before running tests, log additional information that might help explain changes in execution time. Signed-off-by: Dan Eble <Dan.Eble@amd.com>

eble-amd · 2026-05-07T13:34:08Z

The last push rebases and updates the workflow to route performance-test jobs to newly labeled runners.

My current focus is on the GitHub workflow. I didn't check whether any performance changes are expected due to the rebasing or changes in the latest nightly ROCm; the tests might fail even if the workflow is correct.

eble-amd · 2026-05-07T15:29:04Z

Changes in the last push:

try coping with differences in performance from runner to runner by treating failures as warnings in CI
revise a note in the readme to fit the earlier change to rewrite golden files in place

Add performance regression tests in test_hybrid_w4a16_perf.py. See README.md for information on developer workflow. The CI runner pool has two kinds of gfx1151 runners. Run the performance and correctness tests separately so that runners labeled for performance testing do not necessarily spend time testing correctness. GitHub might still route a correctness job to a performance runner; the point is that we're not forcing it. Even with the current labels, we still see inconsistent performance from one runner to the next. This should be investigated, but for now, failures of particular performance test cases are ignored except for a warning hidden on the job detail page. Signed-off-by: Dan Eble <Dan.Eble@amd.com>

Rework kernel correctness and performance tests as separate jobs rather than parts of a matrix job. Surface failures of performance tests (which are currently intermittent depending on runner) without blocking wheel upload. Signed-off-by: Dan Eble <Dan.Eble@amd.com>

eble-amd · 2026-05-07T19:10:35Z

I chose to implement the test workflow two different ways in two commits. The first change looks simpler, but I didn't like that one had to click through to details to discover that performance tests failed.

I added a commit on top reworking the kernel correctness and performance tests as separate jobs rather than parts of a matrix job. Failures in performance tests are visible on the main page of a PR, but do not gate the wheel upload. The purpose of leaving this in a separate commit is to make it easier to revert to the first approach if we discover something we don't like. (I'm a GitHub workflow newbie and this arrangement was heavily assisted by AI.)

I intend for performance test failures not to block merging. It remains to be proven, but I think it will work.

I have seen performance tests assigned to four runners. They fail on the runners with 6 and 7 in their names, and they pass on the runners with 8 and 9 in their names. Since the performance tests run in their own job, it is easy to rerun them if you really, really want results from one of the capable runners.

eble-amd requested review from amd-callumm and mgehre-amd April 24, 2026 20:26

eble-amd changed the base branch from main to gfx11 April 24, 2026 20:27

eble-amd force-pushed the quant-kernel-perf-regression branch 4 times, most recently from b38c2db to 467e8c2 Compare April 29, 2026 16:40

eble-amd mentioned this pull request Apr 29, 2026

CI test for automated attention benchmarking suite #897

Merged

eble-amd force-pushed the quant-kernel-perf-regression branch 7 times, most recently from c8b2d6d to 0217c18 Compare May 5, 2026 17:11

eble-amd marked this pull request as ready for review May 5, 2026 17:41

eble-amd changed the title ~~Draft: hybrid_w4a16 quant kernel perf regression tests~~ Hybrid W4Aa16 quant kernel perf regression tests May 5, 2026

eble-amd changed the title ~~Hybrid W4Aa16 quant kernel perf regression tests~~ Hybrid W4A16 quant kernel perf regression tests May 5, 2026

eble-amd requested review from marcusr-amd, mkorhone, roberteg16 and serged-amd May 5, 2026 18:32

mgehre-amd reviewed May 5, 2026

View reviewed changes

Comment thread tests/kernels/quantization/test_hybrid_w4a16_perf.py Outdated

mgehre-amd reviewed May 5, 2026

View reviewed changes

Comment thread tests/kernels/quantization/test_hybrid_w4a16_perf.py Outdated

mgehre-amd reviewed May 5, 2026

View reviewed changes

Comment thread tests/kernels/quantization/test_hybrid_w4a16_perf.py Outdated

eble-amd force-pushed the quant-kernel-perf-regression branch from 0217c18 to 8395001 Compare May 6, 2026 15:30

eble-amd force-pushed the quant-kernel-perf-regression branch from 8395001 to 5d648ea Compare May 6, 2026 17:02

eble-amd force-pushed the quant-kernel-perf-regression branch from 5d648ea to bf1427c Compare May 6, 2026 18:42

eble-amd force-pushed the quant-kernel-perf-regression branch 2 times, most recently from 0c4e098 to bf1427c Compare May 6, 2026 22:22

eble-amd marked this pull request as draft May 6, 2026 22:37

CI: Log GPU DPM profile, clocks, °C

fceb875

Before running tests, log additional information that might help explain changes in execution time. Signed-off-by: Dan Eble <Dan.Eble@amd.com>

eble-amd force-pushed the quant-kernel-perf-regression branch from bf1427c to cbba91b Compare May 7, 2026 13:29

eble-amd force-pushed the quant-kernel-perf-regression branch from cbba91b to 555489d Compare May 7, 2026 15:23

eble-amd added 2 commits May 7, 2026 12:11

eble-amd force-pushed the quant-kernel-perf-regression branch from be8d153 to 71c2833 Compare May 7, 2026 18:13

eble-amd marked this pull request as ready for review May 7, 2026 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid W4A16 quant kernel perf regression tests#898

Hybrid W4A16 quant kernel perf regression tests#898
eble-amd wants to merge 3 commits intoROCm:gfx11from
eble-amd:quant-kernel-perf-regression

eble-amd commented Apr 24, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Uh oh!

Uh oh!

mgehre-amd May 5, 2026

Uh oh!

Uh oh!

mgehre-amd commented May 5, 2026

Uh oh!

mgehre-amd commented May 5, 2026

Uh oh!

eble-amd commented May 6, 2026

Uh oh!

eble-amd commented May 6, 2026

Uh oh!

eble-amd commented May 6, 2026

Uh oh!

eble-amd commented May 6, 2026 •

edited

Loading

Uh oh!

eble-amd commented May 6, 2026

Uh oh!

eble-amd commented May 7, 2026 •

edited

Loading

Uh oh!

eble-amd commented May 7, 2026

Uh oh!

eble-amd commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eble-amd commented Apr 24, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Uh oh!

Uh oh!

mgehre-amd May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgehre-amd commented May 5, 2026

Uh oh!

mgehre-amd commented May 5, 2026

Uh oh!

eble-amd commented May 6, 2026

Uh oh!

eble-amd commented May 6, 2026

Uh oh!

eble-amd commented May 6, 2026

Uh oh!

eble-amd commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eble-amd commented May 6, 2026

Uh oh!

eble-amd commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eble-amd commented May 7, 2026

Uh oh!

eble-amd commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eble-amd commented Apr 24, 2026 •

edited by github-actions Bot

Loading

eble-amd commented May 6, 2026 •

edited

Loading

eble-amd commented May 7, 2026 •

edited

Loading