Skip to content

CI test for automated attention benchmarking suite#897

Merged
amd-callumm merged 4 commits intogfx11from
callumm.attn_bench_test_ci
May 1, 2026
Merged

CI test for automated attention benchmarking suite#897
amd-callumm merged 4 commits intogfx11from
callumm.attn_bench_test_ci

Conversation

@amd-callumm
Copy link
Copy Markdown

@amd-callumm amd-callumm commented Apr 24, 2026

Purpose

Leverage the existing pytest and (manual) attention backend benchmarking infrastructure to implement automated attention performance regression tests. Each test runs benchmark.py against a YAML config designed to imitate a model of interest (number of heads, head dimensions, etc) while defining certain batch specs (input/output tokens, batch count) and attention backends to run. The current set of tests covers the TRITON_ATTN and ROCM_AITER_UNIFIED_ATTN backends. Each model config includes a long-context prefill-only case, decode-only, and one prefill/decode combination of interest for Strix Halo. The YAML file also defines the number of warmup + benchmark iterations to run for each of these cases.

The automated tests run each model config + batch spec + backend combination, with a 10-second cooldown between each to minimize the risk of thermal GPU throttling that could lead to unstable results. Each such case's results are output to a json file under tests/kernels/attention/benchmark/output/<gfx_target>/, which is compared to a golden reference/baseline.

Currently, these tests are Strix Halo only, but the infrastructure can easily support other platforms such as Strix Point.

Test cases can be marked as "skip" to avoid running the benchmarks, or "intermittent" to mark tests as working, but with unstable performance. Intermittent cases' performance will only be compared to the golden

For now, these tests are not run in any CI job (similar to @eble-amd, I saw far slower performance running on the CI machine compared to my local one; until this is understood, the CI job will not be useful).

Test command

pytest tests/kernels/attention/benchmark/test_benchmark_attention.py::test_benchmark_regression [--attn-bench-intermittent]

Test Result

After 5 consecutive runs on my local machine, out of 30 test cases (5 model configs * 3 batch cases * 2 backends), all showed less than 10% variance in the mean time-per-iteration compared to goldens. 27 of these showed less than 1% variance in all 5 runs.

No tests currently require the skip or intermittent flags, but both of these flags have been manually validated during development.

During test runs, I monitored my Strix Halo machine's GPU temperature at 5-second intervals and found that a 10-second cooldown was sufficient to keep the edge temperature below 65°C even with repeated runs of the test suite; well below typical thermal throttling thresholds. I haven't tried to push this interval any lower.

@amd-callumm amd-callumm force-pushed the callumm.attn_bench_test_ci branch 12 times, most recently from ab95a95 to 6bc7e34 Compare April 28, 2026 23:54
@amd-callumm amd-callumm marked this pull request as ready for review April 28, 2026 23:58
Copy link
Copy Markdown

@eble-amd eble-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good mainly. Three things I recommend changing:

  • rename the test function
  • rename pct_change
  • clarify config validation

The rest could be ignored.

Comment thread tests/kernels/attention/benchmark/conftest.py Outdated
Comment thread tests/kernels/attention/benchmark/test_benchmark_attention.py Outdated
Comment on lines +414 to +415
)
def test_benchmark_regression(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #898 proposes a @pytest.mark.benchmark marker that can be used to filter tests. If you don't wish to bring it into this PR, that's no problem; I'll mark these as benchmarks when I rebase #898 onto your changes.

Comment thread tests/kernels/attention/benchmark/test_benchmark_attention.py Outdated
Comment thread tests/kernels/attention/benchmark/test_benchmark_attention.py Outdated
Comment thread tests/kernels/attention/benchmark/test_benchmark_attention.py Outdated
Comment on lines +87 to +88
# Subprocess timeout (seconds)
BENCHMARK_TIMEOUT = 900
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a blocking issue (because you're not running these tests in CI), but this is longer than the --timeout 300 in build-rocm-wheels.yml.

Comment thread tests/kernels/attention/benchmark/test_benchmark_attention.py Outdated
block_size: 16

batch_specs:
- "q8ks128"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it is impossible to have 8k tokens to process but only 128 entries in the KV cache. Please double check.
For real inference with input tokens = 128/256/4096/8192, output_len=1, and optional spec decode with up to 4 draft tokens, the correct specs are:

Phase Input=128 Input=4096 Input=8192
Prefill q128 q4k q8k
Decode q1s128 q1s4k q1s8k
Spec decode verify (4 tokens) q4s128 q4s4k q4s8k

Interestingly, this was also the only configuration that had 6% difference to the golden values on my machine, where all other configurations had less than 0.8% difference.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, good catch. The intention was to run a mixed prefill/decode case with 8k input tokens and 128 output ones, matching a use case we're interested in. I had mistakenly thought that s128 referred to output token count, not KV cache size. I'll update all the configs to reflect my actual intentions (and regenerate the relevant goldens).

This case and the q8ks1 case, both with TRITON_ATTN, showed the most variation for me, also around 6%. This probably isn't a coincidence.

@pytest.mark.parametrize(
"config_name,target_platforms",
[
("gemma_2b_awq", ["gfx1151"]),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list nicely describes the test cases that we want to run. Why do we need to track "skip" and "intermittent" on top of that?

Copy link
Copy Markdown
Author

@amd-callumm amd-callumm Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameterized list works at YAML file granularity level, but each YAML can define a list of batch specs and backends to benchmark. "skip" and "intermittent" flags work on individual config + batch + backend combinations or patterns. For example, each new config currently has 3 batch specs * 2 backends = 6 benchmarks per parameterized test call. To make this list more granular, we'd need to define a separate YAML for every single case, or pass the full configuration as command line parameters to benchmark.py rather than a combined YAML, and including config parameters in this list.

The idea for skip and intermittent flags came from planning discussions between @eble and myself. On prior projects with a similar golden structure (specifically, comparing an encoder's video stream output to a reference to ensure correctness), sometimes only specific cases would break, or might hang/crash/etc on some or all runs. It often made the most sense to keep such a test case defined, but temporarily skip it in CI runs while creating a ticket to fix it. Allowing workflows like this is what the skip flag is based on.

Since these are performance tests, the intermittent flag has a similar purposes, but for high performance variation rather than a consistent failure/regression. We run these cases but skip validating the performance results against the golden (except when passing a flag to check them).

Signed-off-by: Callum Mitchell <callumm@amd.com>

Signed-off-by:  <callumm@amd.com>
Co-authored-by: Claude
Signed-off-by: Callum Mitchell <callumm@amd.com>

Signed-off-by:  <callumm@amd.com>
Co-authored-by: Claude
Signed-off-by: Callum Mitchell <callumm@amd.com>

Signed-off-by:  <callumm@amd.com>
Signed-off-by: Callum Mitchell <callumm@amd.com>

Signed-off-by:  <callumm@amd.com>
@amd-callumm amd-callumm force-pushed the callumm.attn_bench_test_ci branch from 6bc7e34 to fbfbfd3 Compare May 1, 2026 18:22
@amd-callumm amd-callumm changed the title [WiP] CI test for automated attention benchmarking suite CI test for automated attention benchmarking suite May 1, 2026
@amd-callumm
Copy link
Copy Markdown
Author

PR comments addressed and tests are rebased on top of the gfx11 branch.

@amd-callumm amd-callumm merged commit aced523 into gfx11 May 1, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants