Skip to content

Hybrid W4A16 quant kernel perf regression tests#898

Open
eble-amd wants to merge 3 commits intoROCm:gfx11from
eble-amd:quant-kernel-perf-regression
Open

Hybrid W4A16 quant kernel perf regression tests#898
eble-amd wants to merge 3 commits intoROCm:gfx11from
eble-amd:quant-kernel-perf-regression

Conversation

@eble-amd
Copy link
Copy Markdown

@eble-amd eble-amd commented Apr 24, 2026

Purpose

Add automated PR verification tests covering the performance of the hybrid W4A16 quant kernels (HIP + Triton).

Test Plan

  1. Generate initial golden values on a developer system.
    The update workflow is documented in the new file ./tests/kernels/quantization/golden/README.md.
  2. Rerun a few times locally to check for intermittency.
  3. Rerun a few times in CI to check for intermittency.

Test Result

See CI job logs.

Click here for an excerpt of a passing run
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2048-o2048-g128-hybrid-w4a16] PASSED [  2%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2048-o2048-g128-hybrid-w4a16-zp] PASSED [  5%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2048-o2560-g128-hybrid-w4a16] PASSED [  8%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2048-o2560-g128-hybrid-w4a16-zp] PASSED [ 11%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2048-o32768-g128-hybrid-w4a16] PASSED [ 14%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2048-o32768-g128-hybrid-w4a16-zp] PASSED [ 17%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2560-o2560-g128-hybrid-w4a16] PASSED [ 20%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2560-o2560-g128-hybrid-w4a16-zp] PASSED [ 23%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2560-o3840-g128-hybrid-w4a16] PASSED [ 26%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2560-o3840-g128-hybrid-w4a16-zp] PASSED [ 29%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2560-o19456-g128-hybrid-w4a16] PASSED [ 32%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i2560-o19456-g128-hybrid-w4a16-zp] PASSED [ 35%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i3584-o3584-g128-hybrid-w4a16] PASSED [ 38%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i3584-o3584-g128-hybrid-w4a16-zp] PASSED [ 41%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i3584-o4608-g128-hybrid-w4a16] PASSED [ 44%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i3584-o4608-g128-hybrid-w4a16-zp] PASSED [ 47%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i3584-o37888-g128-hybrid-w4a16] PASSED [ 50%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i3584-o37888-g128-hybrid-w4a16-zp] PASSED [ 52%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i8192-o512-g128-hybrid-w4a16] PASSED [ 55%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i8192-o512-g128-hybrid-w4a16-zp] PASSED [ 58%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i8320-o512-g128-hybrid-w4a16] PASSED [ 61%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i8320-o512-g128-hybrid-w4a16-zp] PASSED [ 64%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i9728-o2560-g128-hybrid-w4a16] PASSED [ 67%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i9728-o2560-g128-hybrid-w4a16-zp] PASSED [ 70%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i16384-o2048-g128-hybrid-w4a16] PASSED [ 73%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i16384-o2048-g128-hybrid-w4a16-zp] PASSED [ 76%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i18944-o3584-g128-hybrid-w4a16] PASSED [ 79%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i18944-o3584-g128-hybrid-w4a16-zp] PASSED [ 82%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i38912-o2048-g128-hybrid-w4a16] PASSED [ 85%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i38912-o2048-g128-hybrid-w4a16-zp] PASSED [ 88%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i49152-o2048-g128-hybrid-w4a16] PASSED [ 91%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i49152-o2048-g128-hybrid-w4a16-zp] PASSED [ 94%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i49152-o4096-g128-hybrid-w4a16] PASSED [ 97%]
tests/kernels/quantization/test_hybrid_w4a16_perf.py::test_hybrid_w4a16_perf[i49152-o4096-g128-hybrid-w4a16-zp] PASSED [100%]

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@eble-amd eble-amd changed the base branch from main to gfx11 April 24, 2026 20:27
@eble-amd eble-amd force-pushed the quant-kernel-perf-regression branch 4 times, most recently from b38c2db to 467e8c2 Compare April 29, 2026 16:40
@eble-amd eble-amd force-pushed the quant-kernel-perf-regression branch 7 times, most recently from c8b2d6d to 0217c18 Compare May 5, 2026 17:11
@eble-amd eble-amd marked this pull request as ready for review May 5, 2026 17:41
@eble-amd eble-amd changed the title Draft: hybrid_w4a16 quant kernel perf regression tests Hybrid W4Aa16 quant kernel perf regression tests May 5, 2026
@eble-amd eble-amd changed the title Hybrid W4Aa16 quant kernel perf regression tests Hybrid W4A16 quant kernel perf regression tests May 5, 2026
Comment thread tests/kernels/quantization/test_hybrid_w4a16_perf.py Outdated
Comment thread tests/kernels/quantization/test_hybrid_w4a16_perf.py Outdated
from vllm.utils.platform_utils import num_compute_units

device = "cuda"
dtype = torch.float16
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We see different performance for bfloat16 which is needed for some models. Can we store the dtype in the SHAPES array and golden refs? This also affects the dtype of w_s_skinny

Comment thread tests/kernels/quantization/test_hybrid_w4a16_perf.py Outdated
@mgehre-amd
Copy link
Copy Markdown

On my machine, some tests fail:

E             batch_size=128: 23.05 TFLOP/s (expected 20.04 + [-15, 15]%) FAIL (improvement, +15.1%) [hybrid_triton_w4a16] [72°C]
E             batch_size=64: 8.82 TFLOP/s (expected 7.46 + [-15, 15]%) FAIL (improvement, +18.3%) [hybrid_triton_w4a16] [73°C]
E             batch_size=64: 18.95 TFLOP/s (expected 16.46 + [-15, 15]%) FAIL (improvement, +15.1%) [hybrid_triton_w4a16] [71°C]
E             batch_size=64: 18.84 TFLOP/s (expected 15.92 + [-15, 15]%) FAIL (improvement, +18.4%) [hybrid_triton_w4a16] [68°C]
E             batch_size=64: 21.28 TFLOP/s (expected 17.19 + [-15, 15]%) FAIL (improvement, +23.8%) [hybrid_triton_w4a16] [75°C]
E             batch_size=64: 20.80 TFLOP/s (expected 17.11 + [-15, 15]%) FAIL (improvement, +21.6%) [hybrid_triton_w4a16] [74°C]
E             batch_size=64: 20.41 TFLOP/s (expected 17.08 + [-15, 15]%) FAIL (improvement, +19.5%) [hybrid_triton_w4a16] [75°C]
E             batch_size=64: 19.89 TFLOP/s (expected 16.92 + [-15, 15]%) FAIL (improvement, +17.5%) [hybrid_triton_w4a16] [74°C]
E             batch_size=64: 20.78 TFLOP/s (expected 17.76 + [-15, 15]%) FAIL (improvement, +17.0%) [hybrid_triton_w4a16] [82°C]
E             batch_size=128: 10.92 TFLOP/s (expected 9.40 + [-15, 15]%) FAIL (improvement, +16.2%) [hybrid_triton_w4a16] [73°C]
E             batch_size=512: 21.10 TFLOP/s (expected 18.18 + [-15, 15]%) FAIL (improvement, +16.1%) [hybrid_triton_w4a16] [80°C]
E             batch_size=64: 18.81 TFLOP/s (expected 15.45 + [-15, 15]%) FAIL (improvement, +21.8%) [hybrid_triton_w4a16] [80°C]
E             batch_size=64: 19.85 TFLOP/s (expected 16.95 + [-15, 15]%) FAIL (improvement, +17.1%) [hybrid_triton_w4a16] [86°C]
E             batch_size=32: 8.74 TFLOP/s (expected 7.58 + [-15, 15]%) FAIL (improvement, +15.2%) [hybrid_triton_w4a16] [78°C]
E             batch_size=64: 7.08 TFLOP/s (expected 6.04 + [-15, 15]%) FAIL (improvement, +17.2%) [hybrid_triton_w4a16] [74°C]
E             batch_size=128: 6.45 TFLOP/s (expected 9.14 + [-15, 15]%) FAIL (regression, -29.4%) [hybrid_triton_w4a16] [74°C]
E             batch_size=256: 13.64 TFLOP/s (expected 11.82 + [-15, 15]%) FAIL (improvement, +15.4%) [hybrid_triton_w4a16] [81°C]

Maybe we can exclude the hybrid_triton_w4a16 for now and only keep the hip kernels with look pretty stable?

@mgehre-amd
Copy link
Copy Markdown

For the N=1,2,4 i.e. wvsplitk_int4, can we show GiB/s instead of TFLOP/s on the benchmark output? Those are memory-bound and we want them to be close to 230 GiB/s; TFLOP/s are meaningless for them.
Maybe that means we should store the golden values as "ms" and just convert on display?

@eble-amd eble-amd force-pushed the quant-kernel-perf-regression branch from 0217c18 to 8395001 Compare May 6, 2026 15:30
@eble-amd
Copy link
Copy Markdown
Author

eble-amd commented May 6, 2026

The last push was just a rebase.

@eble-amd eble-amd force-pushed the quant-kernel-perf-regression branch from 8395001 to 5d648ea Compare May 6, 2026 17:02
@eble-amd
Copy link
Copy Markdown
Author

eble-amd commented May 6, 2026

The last push changes the golden-generation workflow to use the option --write-golden and write directly to the golden file instead of a file on the side.

It also updates golden values after rebasing and fetching the latest nightly ROCm.

@eble-amd
Copy link
Copy Markdown
Author

eble-amd commented May 6, 2026

Maybe we can exclude the hybrid_triton_w4a16 for now and only keep the hip kernels with look pretty stable?

Yes, I think we will have to do that, because the Triton tests are failing similarly on the CI runner.

@eble-amd eble-amd force-pushed the quant-kernel-perf-regression branch from 5d648ea to bf1427c Compare May 6, 2026 18:42
@eble-amd
Copy link
Copy Markdown
Author

eble-amd commented May 6, 2026

The last push sets the tolerance on the Triton kernel to ±80%. The intent is to continue checking that the Triton kernel is selected in the right cases, but not to worry about its performance unless the change is egregious.

The last push also cuts the inter-test cool-down delay short if the temperature is already below 60 °C. In my testing, this saved about 30 s when starting cold, but offered no benefit when already warm:

run 1: 139.95s
run 2: 173.97s
run 3: 176.90s

The last push also narrows the tolerance on the HIP kernel to ±8%.

@eble-amd eble-amd force-pushed the quant-kernel-perf-regression branch 2 times, most recently from 0c4e098 to bf1427c Compare May 6, 2026 22:22
@eble-amd
Copy link
Copy Markdown
Author

eble-amd commented May 6, 2026

Reviewers, I'm sorry for such a noisy PR. I'm converting it back to a draft.

Two kinds of runners are currently grouped under one label, and performance seems to depend on which kind gets the job. DevOps will relabel the runners so that we can constrain where these tests are run.

@eble-amd eble-amd marked this pull request as draft May 6, 2026 22:37
Before running tests, log additional information that might help explain
changes in execution time.

Signed-off-by: Dan Eble <Dan.Eble@amd.com>
@eble-amd eble-amd force-pushed the quant-kernel-perf-regression branch from bf1427c to cbba91b Compare May 7, 2026 13:29
@eble-amd
Copy link
Copy Markdown
Author

eble-amd commented May 7, 2026

The last push rebases and updates the workflow to route performance-test jobs to newly labeled runners.

My current focus is on the GitHub workflow. I didn't check whether any performance changes are expected due to the rebasing or changes in the latest nightly ROCm; the tests might fail even if the workflow is correct.

@eble-amd eble-amd force-pushed the quant-kernel-perf-regression branch from cbba91b to 555489d Compare May 7, 2026 15:23
@eble-amd
Copy link
Copy Markdown
Author

eble-amd commented May 7, 2026

Changes in the last push:

  • try coping with differences in performance from runner to runner by treating failures as warnings in CI
  • revise a note in the readme to fit the earlier change to rewrite golden files in place

eble-amd added 2 commits May 7, 2026 12:11
Add performance regression tests in test_hybrid_w4a16_perf.py. See
README.md for information on developer workflow.

The CI runner pool has two kinds of gfx1151 runners.  Run the
performance and correctness tests separately so that runners labeled for
performance testing do not necessarily spend time testing correctness.
GitHub might still route a correctness job to a performance runner; the
point is that we're not forcing it.

Even with the current labels, we still see inconsistent performance from
one runner to the next.  This should be investigated, but for now,
failures of particular performance test cases are ignored except for a
warning hidden on the job detail page.

Signed-off-by: Dan Eble <Dan.Eble@amd.com>
Rework kernel correctness and performance tests as separate jobs rather
than parts of a matrix job.  Surface failures of performance tests
(which are currently intermittent depending on runner) without blocking
wheel upload.

Signed-off-by: Dan Eble <Dan.Eble@amd.com>
@eble-amd eble-amd force-pushed the quant-kernel-perf-regression branch from be8d153 to 71c2833 Compare May 7, 2026 18:13
@eble-amd eble-amd marked this pull request as ready for review May 7, 2026 18:54
@eble-amd
Copy link
Copy Markdown
Author

eble-amd commented May 7, 2026

I chose to implement the test workflow two different ways in two commits. The first change looks simpler, but I didn't like that one had to click through to details to discover that performance tests failed.

I added a commit on top reworking the kernel correctness and performance tests as separate jobs rather than parts of a matrix job. Failures in performance tests are visible on the main page of a PR, but do not gate the wheel upload. The purpose of leaving this in a separate commit is to make it easier to revert to the first approach if we discover something we don't like. (I'm a GitHub workflow newbie and this arrangement was heavily assisted by AI.)

I intend for performance test failures not to block merging. It remains to be proven, but I think it will work.

I have seen performance tests assigned to four runners. They fail on the runners with 6 and 7 in their names, and they pass on the runners with 8 and 9 in their names. Since the performance tests run in their own job, it is easy to rerun them if you really, really want results from one of the capable runners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants