CI test for automated attention benchmarking suite by amd-callumm · Pull Request #897 · ROCm/vllm

amd-callumm · 2026-04-24T20:09:53Z

Purpose

Leverage the existing pytest and (manual) attention backend benchmarking infrastructure to implement automated attention performance regression tests. Each test runs benchmark.py against a YAML config designed to imitate a model of interest (number of heads, head dimensions, etc) while defining certain batch specs (input/output tokens, batch count) and attention backends to run. The current set of tests covers the TRITON_ATTN and ROCM_AITER_UNIFIED_ATTN backends. Each model config includes a long-context prefill-only case, decode-only, and one prefill/decode combination of interest for Strix Halo. The YAML file also defines the number of warmup + benchmark iterations to run for each of these cases.

The automated tests run each model config + batch spec + backend combination, with a 10-second cooldown between each to minimize the risk of thermal GPU throttling that could lead to unstable results. Each such case's results are output to a json file under tests/kernels/attention/benchmark/output/<gfx_target>/, which is compared to a golden reference/baseline.

Currently, these tests are Strix Halo only, but the infrastructure can easily support other platforms such as Strix Point.

Test cases can be marked as "skip" to avoid running the benchmarks, or "intermittent" to mark tests as working, but with unstable performance. Intermittent cases' performance will only be compared to the golden

For now, these tests are not run in any CI job (similar to @eble-amd, I saw far slower performance running on the CI machine compared to my local one; until this is understood, the CI job will not be useful).

Test command

pytest tests/kernels/attention/benchmark/test_benchmark_attention.py::test_benchmark_regression [--attn-bench-intermittent]

Test Result

After 5 consecutive runs on my local machine, out of 30 test cases (5 model configs * 3 batch cases * 2 backends), all showed less than 10% variance in the mean time-per-iteration compared to goldens. 27 of these showed less than 1% variance in all 5 runs.

No tests currently require the skip or intermittent flags, but both of these flags have been manually validated during development.

During test runs, I monitored my Strix Halo machine's GPU temperature at 5-second intervals and found that a 10-second cooldown was sufficient to keep the edge temperature below 65°C even with repeated runs of the test suite; well below typical thermal throttling thresholds. I haven't tried to push this interval any lower.

eble-amd

This looks good mainly. Three things I recommend changing:

rename the test function
rename pct_change
clarify config validation

The rest could be ignored.

eble-amd · 2026-04-29T17:41:51Z

+)
+def test_benchmark_regression(


PR #898 proposes a @pytest.mark.benchmark marker that can be used to filter tests. If you don't wish to bring it into this PR, that's no problem; I'll mark these as benchmarks when I rebase #898 onto your changes.

eble-amd · 2026-04-29T18:49:41Z

+# Subprocess timeout (seconds)
+BENCHMARK_TIMEOUT = 900


This is not a blocking issue (because you're not running these tests in CI), but this is longer than the --timeout 300 in build-rocm-wheels.yml.

mgehre-amd · 2026-04-30T08:36:31Z

+  block_size: 16
+
+batch_specs:
+  - "q8ks128"


It looks like it is impossible to have 8k tokens to process but only 128 entries in the KV cache. Please double check.
For real inference with input tokens = 128/256/4096/8192, output_len=1, and optional spec decode with up to 4 draft tokens, the correct specs are:

Phase Input=128 Input=4096 Input=8192

Prefill q128 q4k q8k

Decode q1s128 q1s4k q1s8k

Spec decode verify (4 tokens) q4s128 q4s4k q4s8k

Interestingly, this was also the only configuration that had 6% difference to the golden values on my machine, where all other configurations had less than 0.8% difference.

Right, good catch. The intention was to run a mixed prefill/decode case with 8k input tokens and 128 output ones, matching a use case we're interested in. I had mistakenly thought that s128 referred to output token count, not KV cache size. I'll update all the configs to reflect my actual intentions (and regenerate the relevant goldens).

This case and the q8ks1 case, both with TRITON_ATTN, showed the most variation for me, also around 6%. This probably isn't a coincidence.

mgehre-amd · 2026-04-30T08:41:49Z

+@pytest.mark.parametrize(
+    "config_name,target_platforms",
+    [
+        ("gemma_2b_awq", ["gfx1151"]),


This list nicely describes the test cases that we want to run. Why do we need to track "skip" and "intermittent" on top of that?

This parameterized list works at YAML file granularity level, but each YAML can define a list of batch specs and backends to benchmark. "skip" and "intermittent" flags work on individual config + batch + backend combinations or patterns. For example, each new config currently has 3 batch specs * 2 backends = 6 benchmarks per parameterized test call. To make this list more granular, we'd need to define a separate YAML for every single case, or pass the full configuration as command line parameters to benchmark.py rather than a combined YAML, and including config parameters in this list.

The idea for skip and intermittent flags came from planning discussions between @eble and myself. On prior projects with a similar golden structure (specifically, comparing an encoder's video stream output to a reference to ensure correctness), sometimes only specific cases would break, or might hang/crash/etc on some or all runs. It often made the most sense to keep such a test case defined, but temporarily skip it in CI runs while creating a ticket to fix it. Allowing workflows like this is what the skip flag is based on.

Since these are performance tests, the intermittent flag has a similar purposes, but for high performance variation rather than a consistent failure/regression. We run these cases but skip validating the performance results against the golden (except when passing a flag to check them).

Signed-off-by: Callum Mitchell <callumm@amd.com> Signed-off-by: <callumm@amd.com>

Co-authored-by: Claude Signed-off-by: Callum Mitchell <callumm@amd.com> Signed-off-by: <callumm@amd.com>

Signed-off-by: Callum Mitchell <callumm@amd.com> Signed-off-by: <callumm@amd.com>

amd-callumm · 2026-05-01T18:23:34Z

PR comments addressed and tests are rebased on top of the gfx11 branch.

amd-callumm force-pushed the callumm.attn_bench_test_ci branch 12 times, most recently from ab95a95 to 6bc7e34 Compare April 28, 2026 23:54

amd-callumm marked this pull request as ready for review April 28, 2026 23:58

amd-callumm requested review from eble-amd and mgehre-amd April 28, 2026 23:59

eble-amd reviewed Apr 29, 2026

View reviewed changes

mgehre-amd reviewed Apr 30, 2026

View reviewed changes

amd-callumm added 4 commits May 1, 2026 11:59

[kernels] attn bench YAML configs for pytest

b9d3f5c

Signed-off-by: Callum Mitchell <callumm@amd.com> Signed-off-by: <callumm@amd.com>

[kernels] Add skip, intermittent, cooldown args to attn benchmark script

064f66e

Co-authored-by: Claude Signed-off-by: Callum Mitchell <callumm@amd.com> Signed-off-by: <callumm@amd.com>

[kernels] implement automated attention benchmark pytests

6c81e0e

Co-authored-by: Claude Signed-off-by: Callum Mitchell <callumm@amd.com> Signed-off-by: <callumm@amd.com>

[kernels] golden refs for attention benchmark pytests

fbfbfd3

Signed-off-by: Callum Mitchell <callumm@amd.com> Signed-off-by: <callumm@amd.com>

amd-callumm force-pushed the callumm.attn_bench_test_ci branch from 6bc7e34 to fbfbfd3 Compare May 1, 2026 18:22

amd-callumm changed the title ~~[WiP] CI test for automated attention benchmarking suite~~ CI test for automated attention benchmarking suite May 1, 2026

eble-amd approved these changes May 1, 2026

View reviewed changes

amd-callumm merged commit aced523 into gfx11 May 1, 2026
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI test for automated attention benchmarking suite#897

CI test for automated attention benchmarking suite#897
amd-callumm merged 4 commits intogfx11from
callumm.attn_bench_test_ci

amd-callumm commented Apr 24, 2026 •

edited by github-actions Bot

Loading

Uh oh!

eble-amd left a comment

Uh oh!

Uh oh!

Uh oh!

eble-amd Apr 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eble-amd Apr 29, 2026

Uh oh!

Uh oh!

mgehre-amd Apr 30, 2026

Uh oh!

amd-callumm Apr 30, 2026

Uh oh!

mgehre-amd Apr 30, 2026

Uh oh!

amd-callumm Apr 30, 2026 •

edited

Loading

Uh oh!

amd-callumm commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Phase	Input=128	Input=4096	Input=8192
Prefill	`q128`	`q4k`	`q8k`
Decode	`q1s128`	`q1s4k`	`q1s8k`
Spec decode verify (4 tokens)	`q4s128`	`q4s4k`	`q4s8k`

		)
		def test_benchmark_regression(

Conversation

amd-callumm commented Apr 24, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test command

Test Result

Uh oh!

eble-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eble-amd Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eble-amd Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgehre-amd Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

amd-callumm Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

mgehre-amd Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

amd-callumm Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amd-callumm commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amd-callumm commented Apr 24, 2026 •

edited by github-actions Bot

Loading

amd-callumm Apr 30, 2026 •

edited

Loading