add tuned GEMM config for Qwen3.5 bf16 on MI308X by zovonoir · Pull Request #3077 · ROCm/aiter

zovonoir · 2026-05-08T01:26:57Z

Summary

Add hipblaslt tuned GEMM configuration for Qwen3.5-27B bf16 inference on MI308X (80 CUs, TP2)
Covers common M/N/K shapes for dense linear layers

Benchmark

MI308X, Qwen3.5-27B, ATOM SGLang plugin mode, TP2, ISL=60381, OSL=132:

Version	TPOT (ms)	TTFT (ms)	Tput/GPU (tok/s)
ATOM baseline	18.9	17562	1510
+ layernorm + qwen3_next optimizations (ROCm/ATOM#708, #3051)	15.6	15300	1605
+ tuned GEMM (this patch)	14.8	12613	2078

TTFT reduced by 18% (15300 → 12613 ms)
Throughput improved by 29% (1605 → 2078 tok/s/GPU)

Files Changed

aiter/configs/model_configs/qwen35_bf16_tuned_gemm.csv — tuning config

Test plan

End-to-end benchmark on MI308X with Qwen3.5-27B
CI

🤖 Generated with Claude Code

hipblaslt tuned GEMM configuration for Qwen3.5-27B bf16 inference on MI308X (80 CUs, TP2). Covers common M/N/K shapes for dense linear layers. Benchmark (MI308X, Qwen3.5-27B, TP2, ISL=60381, OSL=132): | Version | TPOT(ms) | TTFT(ms) | Tput/GPU | |---------------------------------|----------|----------|----------| | ATOM baseline | 18.7 | 15266 | 1574 | | + layernorm + qwen3_next optim | 15.6 | 15300 | 1605 | | + tuned GEMM (this patch) | 15.0 | 11405 | 2088 | TTFT reduced by 25% (15300 → 11405 ms), throughput +30% (1605 → 2088). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Copilot wasn't able to review any files in this pull request.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-05-08T01:27:39Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 3077 --add-label <label>

zovonoir · 2026-05-08T03:02:06Z

I will choose other libtypes and try again.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

zovonoir requested review from a team and Copilot May 8, 2026 01:26

Copilot AI reviewed May 8, 2026

View reviewed changes

zovonoir and others added 2 commits May 8, 2026 19:32

update tuned GEMM config with refined shapes

96ca193

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

update tuned GEMM config with improved kernel selections

3b9ca0d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

zovonoir mentioned this pull request May 9, 2026

perf: optimize GDN decode with SGLang fused recurrent kernel ROCm/ATOM#727

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add tuned GEMM config for Qwen3.5 bf16 on MI308X#3077

add tuned GEMM config for Qwen3.5 bf16 on MI308X#3077
zovonoir wants to merge 3 commits intoROCm:mainfrom
zovonoir:perf/qwen35-bf16-tuned-gemm

zovonoir commented May 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

zovonoir commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zovonoir commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Files Changed

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 8, 2026

🏷️ CI Guide

Uh oh!

zovonoir commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zovonoir commented May 8, 2026 •

edited

Loading