Hybrid W4A16 quant kernel perf regression tests#898
Hybrid W4A16 quant kernel perf regression tests#898eble-amd wants to merge 3 commits intoROCm:gfx11from
Conversation
b38c2db to
467e8c2
Compare
c8b2d6d to
0217c18
Compare
| from vllm.utils.platform_utils import num_compute_units | ||
|
|
||
| device = "cuda" | ||
| dtype = torch.float16 |
There was a problem hiding this comment.
We see different performance for bfloat16 which is needed for some models. Can we store the dtype in the SHAPES array and golden refs? This also affects the dtype of w_s_skinny
|
On my machine, some tests fail: Maybe we can exclude the hybrid_triton_w4a16 for now and only keep the hip kernels with look pretty stable? |
|
For the N=1,2,4 i.e. wvsplitk_int4, can we show GiB/s instead of TFLOP/s on the benchmark output? Those are memory-bound and we want them to be close to 230 GiB/s; TFLOP/s are meaningless for them. |
0217c18 to
8395001
Compare
|
The last push was just a rebase. |
8395001 to
5d648ea
Compare
|
The last push changes the golden-generation workflow to use the option It also updates golden values after rebasing and fetching the latest nightly ROCm. |
Yes, I think we will have to do that, because the Triton tests are failing similarly on the CI runner. |
5d648ea to
bf1427c
Compare
|
The last push sets the tolerance on the Triton kernel to ±80%. The intent is to continue checking that the Triton kernel is selected in the right cases, but not to worry about its performance unless the change is egregious. The last push also cuts the inter-test cool-down delay short if the temperature is already below 60 °C. In my testing, this saved about 30 s when starting cold, but offered no benefit when already warm: The last push also narrows the tolerance on the HIP kernel to ±8%. |
0c4e098 to
bf1427c
Compare
|
Reviewers, I'm sorry for such a noisy PR. I'm converting it back to a draft. Two kinds of runners are currently grouped under one label, and performance seems to depend on which kind gets the job. DevOps will relabel the runners so that we can constrain where these tests are run. |
Before running tests, log additional information that might help explain changes in execution time. Signed-off-by: Dan Eble <Dan.Eble@amd.com>
bf1427c to
cbba91b
Compare
|
The last push rebases and updates the workflow to route performance-test jobs to newly labeled runners. My current focus is on the GitHub workflow. I didn't check whether any performance changes are expected due to the rebasing or changes in the latest nightly ROCm; the tests might fail even if the workflow is correct. |
cbba91b to
555489d
Compare
|
Changes in the last push:
|
Add performance regression tests in test_hybrid_w4a16_perf.py. See README.md for information on developer workflow. The CI runner pool has two kinds of gfx1151 runners. Run the performance and correctness tests separately so that runners labeled for performance testing do not necessarily spend time testing correctness. GitHub might still route a correctness job to a performance runner; the point is that we're not forcing it. Even with the current labels, we still see inconsistent performance from one runner to the next. This should be investigated, but for now, failures of particular performance test cases are ignored except for a warning hidden on the job detail page. Signed-off-by: Dan Eble <Dan.Eble@amd.com>
Rework kernel correctness and performance tests as separate jobs rather than parts of a matrix job. Surface failures of performance tests (which are currently intermittent depending on runner) without blocking wheel upload. Signed-off-by: Dan Eble <Dan.Eble@amd.com>
be8d153 to
71c2833
Compare
|
I chose to implement the test workflow two different ways in two commits. The first change looks simpler, but I didn't like that one had to click through to details to discover that performance tests failed. I added a commit on top reworking the kernel correctness and performance tests as separate jobs rather than parts of a matrix job. Failures in performance tests are visible on the main page of a PR, but do not gate the wheel upload. The purpose of leaving this in a separate commit is to make it easier to revert to the first approach if we discover something we don't like. (I'm a GitHub workflow newbie and this arrangement was heavily assisted by AI.) I intend for performance test failures not to block merging. It remains to be proven, but I think it will work. I have seen performance tests assigned to four runners. They fail on the runners with 6 and 7 in their names, and they pass on the runners with 8 and 9 in their names. Since the performance tests run in their own job, it is easy to rerun them if you really, really want results from one of the capable runners. |
Purpose
Add automated PR verification tests covering the performance of the hybrid W4A16 quant kernels (HIP + Triton).
Test Plan
The update workflow is documented in the new file
./tests/kernels/quantization/golden/README.md.Test Result
See CI job logs.
Click here for an excerpt of a passing run
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.