Skip to content

qwen3next/qwen3,5 fix benchmark#681

Closed
ganyi1996ppo wants to merge 1 commit intomainfrom
ganyi/fix_benchmark
Closed

qwen3next/qwen3,5 fix benchmark#681
ganyi1996ppo wants to merge 1 commit intomainfrom
ganyi/fix_benchmark

Conversation

@ganyi1996ppo
Copy link
Copy Markdown
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Copilot AI review requested due to automatic review settings May 2, 2026 09:45
@ganyi1996ppo ganyi1996ppo force-pushed the ganyi/fix_benchmark branch from 121950d to 403859d Compare May 2, 2026 09:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR appears aimed at fixing benchmark/performance issues for Qwen3-Next and Qwen3.5 by removing preallocated output buffers in attention/GDN paths, adding a fused Triton GemmaRMSNorm kernel, and adjusting Triton causal-conv kernels to avoid AMD pointer-canonicalization crashes.

Changes:

  • Remove output out-parameter patterns in Qwen3-Next/Qwen3.5 attention/GatedDeltaNet forward paths and return projected outputs directly.
  • Add a new Triton fused GemmaRMSNorm (+ optional residual add) implementation and route GemmaRMSNorm.forward_cuda to it.
  • Refactor Triton mamba causal conv1d kernels to avoid pointer reassignment (AMD canonicalization issue).

Reviewed changes

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
atom/models/qwen3_next.py Removes out-buffer writes in attention/GDN, switches to returning new tensors.
atom/models/qwen3_5.py Same output-buffer removal pattern; adds FP8 path for in_proj_qkvz.
atom/model_ops/triton_gemma_rmsnorm.py New Triton kernel + launcher for fused Gemma RMSNorm (+ optional residual add).
atom/model_ops/mamba_ops/causal_conv1d.py Reworks stores to q/k/v to avoid pointer reassignment in Triton.
atom/model_ops/layernorm.py Routes GemmaRMSNorm.forward_cuda to the new Triton launcher.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +95 to +103
# Pre-compute per-output feature indices and block-level masks.
# BLOCK_N divides k_dim_size evenly, so each program block falls entirely
# within one of q/k/v — only one mask is all-true per block.
q_feat_idx = idx_feats
k_feat_idx = idx_feats - k_start_dim
v_feat_idx = idx_feats - v_start_dim
is_q_block = idx_feats < k_start_dim
is_k_block = (idx_feats >= k_start_dim) & (idx_feats < v_start_dim)
is_v_block = idx_feats >= v_start_dim
Signed-off-by: ganyi <ygan@amd.com>
@ganyi1996ppo ganyi1996ppo force-pushed the ganyi/fix_benchmark branch from 403859d to e80dd0e Compare May 6, 2026 12:25
@ganyi1996ppo
Copy link
Copy Markdown
Contributor Author

put this fix into #682

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants