qwen3next/qwen3,5 fix benchmark by ganyi1996ppo · Pull Request #681 · ROCm/ATOM

ganyi1996ppo · 2026-05-02T09:45:08Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR appears aimed at fixing benchmark/performance issues for Qwen3-Next and Qwen3.5 by removing preallocated output buffers in attention/GDN paths, adding a fused Triton GemmaRMSNorm kernel, and adjusting Triton causal-conv kernels to avoid AMD pointer-canonicalization crashes.

Changes:

Remove output out-parameter patterns in Qwen3-Next/Qwen3.5 attention/GatedDeltaNet forward paths and return projected outputs directly.
Add a new Triton fused GemmaRMSNorm (+ optional residual add) implementation and route GemmaRMSNorm.forward_cuda to it.
Refactor Triton mamba causal conv1d kernels to avoid pointer reassignment (AMD canonicalization issue).

Reviewed changes

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`atom/models/qwen3_next.py`	Removes out-buffer writes in attention/GDN, switches to returning new tensors.
`atom/models/qwen3_5.py`	Same output-buffer removal pattern; adds FP8 path for `in_proj_qkvz`.
`atom/model_ops/triton_gemma_rmsnorm.py`	New Triton kernel + launcher for fused Gemma RMSNorm (+ optional residual add).
`atom/model_ops/mamba_ops/causal_conv1d.py`	Reworks stores to q/k/v to avoid pointer reassignment in Triton.
`atom/model_ops/layernorm.py`	Routes `GemmaRMSNorm.forward_cuda` to the new Triton launcher.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    # Pre-compute per-output feature indices and block-level masks.
+    # BLOCK_N divides k_dim_size evenly, so each program block falls entirely
+    # within one of q/k/v — only one mask is all-true per block.
+    q_feat_idx = idx_feats
+    k_feat_idx = idx_feats - k_start_dim
+    v_feat_idx = idx_feats - v_start_dim
+    is_q_block = idx_feats < k_start_dim
+    is_k_block = (idx_feats >= k_start_dim) & (idx_feats < v_start_dim)
+    is_v_block = idx_feats >= v_start_dim


Signed-off-by: ganyi <ygan@amd.com>

ganyi1996ppo · 2026-05-06T12:29:40Z

put this fix into #682

Copilot AI review requested due to automatic review settings May 2, 2026 09:45

Copilot started reviewing on behalf of ganyi1996ppo May 2, 2026 09:46 View session

ganyi1996ppo force-pushed the ganyi/fix_benchmark branch from 121950d to 403859d Compare May 2, 2026 09:48

Copilot AI reviewed May 2, 2026

View reviewed changes

maybe fix the ci benchmark crash

e80dd0e

Signed-off-by: ganyi <ygan@amd.com>

ganyi1996ppo force-pushed the ganyi/fix_benchmark branch from 403859d to e80dd0e Compare May 6, 2026 12:25

ganyi1996ppo closed this May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qwen3next/qwen3,5 fix benchmark#681

qwen3next/qwen3,5 fix benchmark#681
ganyi1996ppo wants to merge 1 commit intomainfrom
ganyi/fix_benchmark

ganyi1996ppo commented May 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

ganyi1996ppo commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ganyi1996ppo commented May 2, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

ganyi1996ppo commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants