Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886
Draft
eble-amd wants to merge 3 commits intoROCm:gfx11from
Draft
Improve int4 AWQ GEMV performance on Radeon 8060S and similar GPUs#886eble-amd wants to merge 3 commits intoROCm:gfx11from
eble-amd wants to merge 3 commits intoROCm:gfx11from
Conversation
Signed-off-by: Dan Eble <Dan.Eble@amd.com>
When the int4 weight matrix exceeds L2 cache, wider memory loads (ACHUNK=32 vs 16) improve bandwidth by up to 10% on the wvSplitK_int4_g kernel. The L2 size is queried at runtime via hipDeviceProp, so the threshold adapts to different GPUs. Measured on Radeon 8060S (gfx1151, 2 MiB L2): - 1x2048x16384: 142 -> 156 GiB/s (+10%) - 1x32768x2048: 199 -> 199 GiB/s (no change) - Gemma-2B AWQ decode TPOT: 10.86 -> 10.63 ms (-2%) Add Gemma-2B AWQ and W4-L2-cache-boundary shapes to benchmark_hybrid_w4a16_gemm.py. Signed-off-by: Dan Eble <Dan.Eble@amd.com>
Experiments showed a performance hit when the weight row stride is a multiple of 4096 bytes. Offset each workgroup's K-loop start by one iteration stride and wrap around. A small residual s_sleep stagger covers cases where multiple workgroups still share an offset. Measured on Radeon 8060S (gfx1151): - 1x2048x16384: 156 -> 184 GiB/s (+18%) - 1x32768x2048: 199 -> 199 GiB/s (no change) - Gemma-2B AWQ decode TPOT: 10.63 -> 10.34 ms (-3%) Signed-off-by: Dan Eble <Dan.Eble@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Improve GEMV performance on Radeon 8060S and similar GPUs.
Test Plan
Test Results
Copied from commit messages:
For the L2-size change:
For the staggering change:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.