CI test for automated attention benchmarking suite#897
Conversation
ab95a95 to
6bc7e34
Compare
eble-amd
left a comment
There was a problem hiding this comment.
This looks good mainly. Three things I recommend changing:
- rename the test function
- rename
pct_change - clarify config validation
The rest could be ignored.
| ) | ||
| def test_benchmark_regression( |
| # Subprocess timeout (seconds) | ||
| BENCHMARK_TIMEOUT = 900 |
There was a problem hiding this comment.
This is not a blocking issue (because you're not running these tests in CI), but this is longer than the --timeout 300 in build-rocm-wheels.yml.
| block_size: 16 | ||
|
|
||
| batch_specs: | ||
| - "q8ks128" |
There was a problem hiding this comment.
It looks like it is impossible to have 8k tokens to process but only 128 entries in the KV cache. Please double check.
For real inference with input tokens = 128/256/4096/8192, output_len=1, and optional spec decode with up to 4 draft tokens, the correct specs are:
| Phase | Input=128 | Input=4096 | Input=8192 |
|---|---|---|---|
| Prefill | q128 |
q4k |
q8k |
| Decode | q1s128 |
q1s4k |
q1s8k |
| Spec decode verify (4 tokens) | q4s128 |
q4s4k |
q4s8k |
Interestingly, this was also the only configuration that had 6% difference to the golden values on my machine, where all other configurations had less than 0.8% difference.
There was a problem hiding this comment.
Right, good catch. The intention was to run a mixed prefill/decode case with 8k input tokens and 128 output ones, matching a use case we're interested in. I had mistakenly thought that s128 referred to output token count, not KV cache size. I'll update all the configs to reflect my actual intentions (and regenerate the relevant goldens).
This case and the q8ks1 case, both with TRITON_ATTN, showed the most variation for me, also around 6%. This probably isn't a coincidence.
| @pytest.mark.parametrize( | ||
| "config_name,target_platforms", | ||
| [ | ||
| ("gemma_2b_awq", ["gfx1151"]), |
There was a problem hiding this comment.
This list nicely describes the test cases that we want to run. Why do we need to track "skip" and "intermittent" on top of that?
There was a problem hiding this comment.
This parameterized list works at YAML file granularity level, but each YAML can define a list of batch specs and backends to benchmark. "skip" and "intermittent" flags work on individual config + batch + backend combinations or patterns. For example, each new config currently has 3 batch specs * 2 backends = 6 benchmarks per parameterized test call. To make this list more granular, we'd need to define a separate YAML for every single case, or pass the full configuration as command line parameters to benchmark.py rather than a combined YAML, and including config parameters in this list.
The idea for skip and intermittent flags came from planning discussions between @eble and myself. On prior projects with a similar golden structure (specifically, comparing an encoder's video stream output to a reference to ensure correctness), sometimes only specific cases would break, or might hang/crash/etc on some or all runs. It often made the most sense to keep such a test case defined, but temporarily skip it in CI runs while creating a ticket to fix it. Allowing workflows like this is what the skip flag is based on.
Since these are performance tests, the intermittent flag has a similar purposes, but for high performance variation rather than a consistent failure/regression. We run these cases but skip validating the performance results against the golden (except when passing a flag to check them).
Signed-off-by: Callum Mitchell <callumm@amd.com> Signed-off-by: <callumm@amd.com>
Co-authored-by: Claude Signed-off-by: Callum Mitchell <callumm@amd.com> Signed-off-by: <callumm@amd.com>
Co-authored-by: Claude Signed-off-by: Callum Mitchell <callumm@amd.com> Signed-off-by: <callumm@amd.com>
Signed-off-by: Callum Mitchell <callumm@amd.com> Signed-off-by: <callumm@amd.com>
6bc7e34 to
fbfbfd3
Compare
|
PR comments addressed and tests are rebased on top of the gfx11 branch. |
Purpose
Leverage the existing pytest and (manual) attention backend benchmarking infrastructure to implement automated attention performance regression tests. Each test runs benchmark.py against a YAML config designed to imitate a model of interest (number of heads, head dimensions, etc) while defining certain batch specs (input/output tokens, batch count) and attention backends to run. The current set of tests covers the TRITON_ATTN and ROCM_AITER_UNIFIED_ATTN backends. Each model config includes a long-context prefill-only case, decode-only, and one prefill/decode combination of interest for Strix Halo. The YAML file also defines the number of warmup + benchmark iterations to run for each of these cases.
The automated tests run each model config + batch spec + backend combination, with a 10-second cooldown between each to minimize the risk of thermal GPU throttling that could lead to unstable results. Each such case's results are output to a json file under tests/kernels/attention/benchmark/output/<gfx_target>/, which is compared to a golden reference/baseline.
Currently, these tests are Strix Halo only, but the infrastructure can easily support other platforms such as Strix Point.
Test cases can be marked as "skip" to avoid running the benchmarks, or "intermittent" to mark tests as working, but with unstable performance. Intermittent cases' performance will only be compared to the golden
For now, these tests are not run in any CI job (similar to @eble-amd, I saw far slower performance running on the CI machine compared to my local one; until this is understood, the CI job will not be useful).
Test command
pytest tests/kernels/attention/benchmark/test_benchmark_attention.py::test_benchmark_regression [--attn-bench-intermittent]Test Result
After 5 consecutive runs on my local machine, out of 30 test cases (5 model configs * 3 batch cases * 2 backends), all showed less than 10% variance in the mean time-per-iteration compared to goldens. 27 of these showed less than 1% variance in all 5 runs.
No tests currently require the skip or intermittent flags, but both of these flags have been manually validated during development.
During test runs, I monitored my Strix Halo machine's GPU temperature at 5-second intervals and found that a 10-second cooldown was sufficient to keep the edge temperature below 65°C even with repeated runs of the test suite; well below typical thermal throttling thresholds. I haven't tried to push this interval any lower.