Skip to content

[New feature] integrate causal_conv1d Triton kernel for Ascend NPU #228

Merged
tpx818 merged 12 commits into
modelscope:mainfrom
ys2025-AI:main
Jun 23, 2026
Merged

[New feature] integrate causal_conv1d Triton kernel for Ascend NPU #228
tpx818 merged 12 commits into
modelscope:mainfrom
ys2025-AI:main

Conversation

@ys2025-AI

@ys2025-AI ys2025-AI commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

integrate causal_conv1d Triton kernel for Ascend NPU

Experiment results

Model: Qwen3.5-4B
Hardware: Atlas 900 A3 (2 x NPU)
Dataset: GSM8K_ZH
Finetuning type: LoRA
Software: cann9.0.0+ torch/orch_npu 2.9.0 + MindSpeed 0.12.1 + triton-ascend 3.2.1 + transformers 5.9

指标 Baseline Causal_conv1d优化 差异
加速比 1.0x 1.12x
平均 loss 0.6449 0.6456 差异 0.0007

related: https://gitcode.com/Ascend/MindSpeed-Ops

opencode and others added 6 commits June 15, 2026 21:49
Add self-contained causal_conv1d kernel module (no mindspeed_ops dependency)
with full Triton forward/backward implementations adapted from MindSpeed-Ops.
Patch monkey_patch_npu to bind npu_causal_conv1d_fn on NPU-patched modules,
remove torch fallback in linear_attention_sp, and add NPU-aware causal_conv1d
wrapper in gdn_padding_free (no transpose needed, [B,T,D] native format).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a self-contained, NPU-accelerated causal_conv1d Triton kernel module to support Ascend NPUs, integrating it into the monkey patching, sequence parallel, and padding-free GDN mechanisms. The code review identified several critical issues and bugs: a missing HAS_WEIGHT guard in the backward kernel when storing dw, a shape mismatch and argument-dropping bug in the NPU wrapper within gdn_padding_free.py, compatibility issues with smaller feature dimensions due to a hardcoded block size (BD = 256), and an ignored bias parameter in the forward update kernel. Additionally, an optimization was suggested for _prepare_chunk_indices to avoid host-device synchronization.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/twinkle/kernel/causal_conv1d.py Outdated
Comment thread src/twinkle/patch/gdn_padding_free.py Outdated
Comment thread src/twinkle/kernel/causal_conv1d.py Outdated
Comment thread src/twinkle/kernel/causal_conv1d.py
Comment thread src/twinkle/kernel/causal_conv1d.py Outdated
@tpx818 tpx818 merged commit fbabfa9 into modelscope:main Jun 23, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants