[bench] wvSplitK skinny GEMM: capture timed iters into a CUDA graph#928
Draft
mgehre-amd wants to merge 1 commit intogfx11from
Draft
[bench] wvSplitK skinny GEMM: capture timed iters into a CUDA graph#928mgehre-amd wants to merge 1 commit intogfx11from
mgehre-amd wants to merge 1 commit intogfx11from
Conversation
The previous bench measured each kernel call with its own CUDA event pair and a synchronize() afterward. For sub-100us kernels on Strix Halo, the ~50us idle gap between iters lets the iGPU drop clock, inflating per-call time by 5-15% vs what the model run actually sees (the model launches back-to-back on a stream so the GPU never idles). This made the bench under-report bandwidth and overstate "improvements" from heuristic changes that simply pushed kernel time below the DVFS-induced floor. Switch bench_dynamic to capture iters_per_replay launches (sized so a single replay runs ~target_replay_ms wall) into a CUDA graph and time the replay end-to-end. Adaptive replay count keeps the same target_se_pct convergence behavior. Buffers still rotate via fn(i), so the cache-busting properties of the old loop are preserved. Validated on bf16 against the in-model profile of Intel/Qwen3.5-35B-A3B-int4-AutoRound (--no-cudagraph, --profile): wvSplitK 1x1024x2048 bench old=30.1 us new=27.0 us profile=26.8 us wvSplitK 1x248320x2048 bench old=4357 us new=4329 us profile=4430 us The bench now matches the model-run time within ~1% on both shapes. Tuning: target_se_pct=0.2, max_replays=40, target_replay_ms=20.0, max_time_s=1.0. Wall time on the full 12-shape x 4-batch sweep is ~30s (was ~9s). Repeated runs (with a 60s cooldown between to let the iGPU stay near 60C) agree on 46/48 shapes within 1%; the remaining outliers are thermal noise floor that no measurement setting can remove without locking the GPU clock. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The previous bench measured each kernel call with its own CUDA event pair and a synchronize() afterward. For sub-100us kernels on Strix Halo, the ~50us idle gap between iters lets the iGPU drop clock, inflating per-call time by 5-15% vs what the model run actually sees (the model launches back-to-back on a stream so the GPU never idles). This made the bench under-report bandwidth and overstate "improvements" from heuristic changes that simply pushed kernel time below the DVFS-induced floor.
Switch bench_dynamic to capture iters_per_replay launches (sized so a single replay runs ~target_replay_ms wall) into a CUDA graph and time the replay end-to-end. Adaptive replay count keeps the same target_se_pct convergence behavior. Buffers still rotate via fn(i), so the cache-busting properties of the old loop are preserved.
Validated on bf16 against the in-model profile of
Intel/Qwen3.5-35B-A3B-int4-AutoRound (--no-cudagraph, --profile):
wvSplitK 1x1024x2048 bench old=30.1 us new=27.0 us profile=26.8 us
wvSplitK 1x248320x2048 bench old=4357 us new=4329 us profile=4430 us
The bench now matches the model-run time within ~1% on both shapes.
Tuning: target_se_pct=0.2, max_replays=40, target_replay_ms=20.0, max_time_s=1.0. Wall time on the full 12-shape x 4-batch sweep is ~30s (was ~9s). Repeated runs (with a 60s cooldown between to let the iGPU stay near 60C) agree on 46/48 shapes within 1%; the remaining outliers are thermal noise floor that no measurement setting can remove without locking the GPU clock.