qwen3next/qwen3,5 fix benchmark#681
Closed
ganyi1996ppo wants to merge 1 commit intomainfrom
Closed
Conversation
121950d to
403859d
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR appears aimed at fixing benchmark/performance issues for Qwen3-Next and Qwen3.5 by removing preallocated output buffers in attention/GDN paths, adding a fused Triton GemmaRMSNorm kernel, and adjusting Triton causal-conv kernels to avoid AMD pointer-canonicalization crashes.
Changes:
- Remove
outputout-parameter patterns in Qwen3-Next/Qwen3.5 attention/GatedDeltaNet forward paths and return projected outputs directly. - Add a new Triton fused GemmaRMSNorm (+ optional residual add) implementation and route
GemmaRMSNorm.forward_cudato it. - Refactor Triton mamba causal conv1d kernels to avoid pointer reassignment (AMD canonicalization issue).
Reviewed changes
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
atom/models/qwen3_next.py |
Removes out-buffer writes in attention/GDN, switches to returning new tensors. |
atom/models/qwen3_5.py |
Same output-buffer removal pattern; adds FP8 path for in_proj_qkvz. |
atom/model_ops/triton_gemma_rmsnorm.py |
New Triton kernel + launcher for fused Gemma RMSNorm (+ optional residual add). |
atom/model_ops/mamba_ops/causal_conv1d.py |
Reworks stores to q/k/v to avoid pointer reassignment in Triton. |
atom/model_ops/layernorm.py |
Routes GemmaRMSNorm.forward_cuda to the new Triton launcher. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+95
to
+103
| # Pre-compute per-output feature indices and block-level masks. | ||
| # BLOCK_N divides k_dim_size evenly, so each program block falls entirely | ||
| # within one of q/k/v — only one mask is all-true per block. | ||
| q_feat_idx = idx_feats | ||
| k_feat_idx = idx_feats - k_start_dim | ||
| v_feat_idx = idx_feats - v_start_dim | ||
| is_q_block = idx_feats < k_start_dim | ||
| is_k_block = (idx_feats >= k_start_dim) & (idx_feats < v_start_dim) | ||
| is_v_block = idx_feats >= v_start_dim |
Signed-off-by: ganyi <ygan@amd.com>
403859d to
e80dd0e
Compare
Contributor
Author
|
put this fix into #682 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist