feat(frontend): allow local_slot_num on gm_slot_tensor pipe init by chenshengxin2026 · Pull Request #650 · hw-native-sys/PTOAS

chenshengxin2026 · 2026-05-09T11:39:47Z

Summary

Allow local_slot_num on the address-based (gm_slot_tensor) form of pto.aic_initialize_pipe / pto.aiv_initialize_pipe, so kernels can opt into the manual-aligned TPipe<..., LocalSlotNum=2, ...> template instead of being forced into the current LocalSlotNum=SlotNum=8 default. Non-breaking — when the attribute is absent, the existing lowering behavior is preserved.

Companion issue with full root-cause analysis and reproducer: #649.

Motivation

include/pto/npu/a2a3/TPush.hpp defaults TPipe's LocalSlotNum to 2:

template <uint8_t FlagID, uint8_t DirType, uint32_t SlotSize, uint32_t SlotNum, uint32_t LocalSlotNum = 2, ...>

and kernels/manual/common/flash_atten/fa_performance_kernel.cpp uses LocalSlotNum=2 on QK/PV/P pipes. ptoas-generated address-based pipes today emit LocalSlotNum=8 because:

The frontend verifier rejects local_slot_num on the gm_slot_tensor branch (globaltensor pipe init does not use 'local_slot_num').
Even if it didn't, createFrontendPipe hard-codes IntegerAttr{} for local_slot_num on that branch, so the attribute can't flow through.
buildTPipeTokenFromInitOp then falls back to initOp.getSlotNum() (=8).

The 8/8 vs 2/8 mismatch inflates the FFTS event multiplex (8 local × 8 global per pipe instead of 2 × 8) and overruns --enable-insert-sync's 8-event pool at long sequences (S1>=4096 on a3 in flash-attention).

Changes

`lib/PTO/IR/PTO.cpp`

Frontend verifier (verifyFrontendInitCommon): drop the blanket

if (op.getLocalSlotNumAttr())
  return op.emitOpError("globaltensor pipe init does not use 'local_slot_num'");

on the gm_slot_tensor branch. Validate local_slot_num the same way the legacy gm_slot_buffer branch does — must be > 0, must be <=8 for dir_mask=1/2 or <=4 for dir_mask=3.

InitializeL2G2LPipeOp::verify: allow local_slot_num on the no-local_addr path when gm_addr is a !pto.tensor_view<...> (the gm_slot_tensor form). The previous rule "'local_slot_num' is only allowed when 'local_addr' is present" was tied to the legacy local-FIFO form and predates the address-based slot model. The bounds check (> 0, <= slot_num) still runs in both branches.

`lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp`

createFrontendPipe gm_slot_tensor branch: propagate initOp.getLocalSlotNumAttr() into the lowered InitializeL2G2LPipeOp (was hard-coded to IntegerAttr{}). Combined with buildTPipeTokenFromInitOp in PTOToEmitC.cpp, this makes local_slot_num=N on the IR flow through to TPipe<..., N, ...> in the generated C++.

`test/lit/pto/tpush_tpop_globaltensor_local_slot_num_a3.pto`

New lit test exercising local_slot_num=2 on both aic_initialize_pipe and aiv_initialize_pipe with the gm_slot_tensor form, asserting the lowered TPipe carries ..., 8, 2, ....

Backwards compatibility

When local_slot_num is absent the existing fallback (LocalSlotNum=SlotNum=8 on the address-based path) is preserved, so existing lit tests under test/lit/pto/tpush_tpop_globaltensor_*.pto, test/lit/pto/tpush_tpop_emitc.pto, and test/lit/pto/issue48{1,9}_*.pto continue to pass unchanged.

Out of scope (open question)

Whether the default LocalSlotNum on the address-based path should be 2 (matching the C++ template default) instead of SlotNum is left open. That change touches several existing lit-test expectations and is more invasive; surfacing the attribute is sufficient to unblock the downstream flash-attention kernel. See #649 for discussion.

Test plan

ninja check-pto-lit — existing test/lit/pto/tpush_tpop_globaltensor_*.pto should pass unchanged; new tpush_tpop_globaltensor_local_slot_num_a3.pto should pass.
Downstream kernels/python/flash_atten/kernels/fa_builder.py in hw-native-sys/pto-isa#117, with the companion pto-dsl PR (feat(api): support address-based pipe slot model + tfree(entry, split) PTO-ISA/pto-dsl#8) exposing local_slot_num= on the Python wrapper, currently relies on a compile.sh post-process that rewrites TPipe<..., 8, 8, false> → TPipe<..., 8, 2, false>. With this PR landed, that post-process can be deleted because the IR can now carry local_slot_num=2 directly.
Builds locally cleanly (changes are surgical edits to existing verifier / lowering paths). I do not have CI access on this repository.

frontend: aic/aiv_initialize_pipe rejects local_slot_num on gm_slot_tensor form, forcing LocalSlotNum=8 #649 — companion issue with reproducer and root-cause analysis.
[WIP] feat: Add PTO-DSL Flash Attention performance kernel with Python validation runner pto-isa#117 — PTO-DSL Flash Attention performance kernel; primary downstream consumer.
feat(api): support address-based pipe slot model + tfree(entry, split) PTO-ISA/pto-dsl#8 / Frontend cannot express address-based pipe slot model (gm_slot_tensor + tfree(entry)) huawei-csl/pto-dsl#137 — companion Python wrapper PR/issue.

Previously `pto.aic_initialize_pipe` / `pto.aiv_initialize_pipe` rejected `local_slot_num` whenever they used the address-based `gm_slot_tensor` operand: 'pto.aic_initialize_pipe' op globaltensor pipe init does not use 'local_slot_num' That made it impossible for kernels to mirror the manual-flash-attention TPipe instantiation. The C++ template default in `include/pto/npu/a2a3/TPush.hpp` is `LocalSlotNum=2`, which is also what `kernels/manual/common/flash_atten/fa_performance_kernel.cpp` uses on QK/PV/P pipes. Without an IR-level override, ptoas's address-based pipe lowering ends up emitting `TPipe<..., 8, 8, false>` (LocalSlotNum=SlotNum) because `buildTPipeTokenFromInitOp` in `PTOToEmitC.cpp` falls back to `getSlotNum()` when no `local_slot_num` attribute is present. The mismatch inflates the FFTS event multiplex (8 local x 8 global per pipe instead of 2 local x 8 global) and exhausts `--enable-insert-sync`'s 8-event pool at long sequences, e.g. flash-attention `S1>=4096` on a3. Changes: 1. `lib/PTO/IR/PTO.cpp` - frontend verifier - Drop the blanket "globaltensor pipe init does not use 'local_slot_num'" rejection in `verifyFrontendInitCommon`. - Validate the attribute the same way the legacy `gm_slot_buffer` branch does: must be > 0, must be <= 8 for dir_mask=1/2 or 4 for dir_mask=3. 2. `lib/PTO/IR/PTO.cpp` - InitializeL2G2LPipeOp::verify - Allow `local_slot_num` on the no-`local_addr` path when the `gm_addr` operand is a `!pto.tensor_view<...>` (i.e. the gm_slot_tensor form). The previous rule "'local_slot_num' is only allowed when 'local_addr' is present" was tied to the legacy local-FIFO form and predates the address-based slot model. The `localSlotNum > 0 && <= slot_num` bounds check still runs in both branches. 3. `lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp` - In `createFrontendPipe`, propagate `initOp.getLocalSlotNumAttr()` into the lowered `InitializeL2G2LPipeOp` for the gm_slot_tensor branch (was hard-coded to `IntegerAttr{}`). Combined with the existing `buildTPipeTokenFromInitOp` logic in `PTOToEmitC.cpp`, this makes `local_slot_num=N` on the IR flow through to `TPipe<..., N, ...>` in the generated C++. 4. `test/lit/pto/tpush_tpop_globaltensor_local_slot_num_a3.pto` - New lit test exercising `local_slot_num=2` on both `aic_initialize_pipe` and `aiv_initialize_pipe` with the gm_slot_tensor form, asserting the lowered TPipe carries `..., 8, 2, ...`. Backwards compatibility: when `local_slot_num` is absent the existing fallback (LocalSlotNum=SlotNum=8 on the address-based path, LocalSlotNum=SlotNum on the legacy path) is preserved, so existing tests under `test/lit/pto/tpush_tpop_globaltensor_*.pto` continue to pass unchanged. Note: the wider question of whether the *default* LocalSlotNum on the address-based path should be 2 (matching the C++ TPipe template default) instead of SlotNum is left open. That change touches several existing lit-test expectations and is more invasive; surfacing the attribute is sufficient to unblock the manual-parity flash-attention kernel in hw-native-sys/pto-isa#117. Motivating downstream: - hw-native-sys/pto-isa#117 - PTO-DSL Flash Attention performance kernel consumed via PTO-ISA/pto-dsl#8 (companion frontend PR exposing `local_slot_num=` on the Python wrapper).

gemini-code-assist

Code Review

This pull request enables the 'local_slot_num' attribute for globaltensor pipe initializations, incorporating validation checks in the IR and ensuring propagation during lowering. A new test case is added to verify the generated output. The review feedback recommends refactoring duplicated validation logic in 'lib/PTO/IR/PTO.cpp' to improve maintainability.

gemini-code-assist · 2026-05-09T11:41:50Z

+    if (auto localSlotNumAttr = op.getLocalSlotNumAttr()) {
+      int32_t localSlotNum = localSlotNumAttr.getInt();
+      if (localSlotNum <= 0)
+        return op.emitOpError("expects 'local_slot_num' to be greater than 0");
+      int32_t loweredSlotNum = dirMask == 3 ? 4 : 8;
+      if (localSlotNum > loweredSlotNum) {
+        return op.emitOpError()
+               << "expects 'local_slot_num' to be less than or equal to "
+               << loweredSlotNum << " for dir_mask = " << static_cast<int>(dirMask);
+      }
+    }


The validation logic for local_slot_num (including the bounds check and the calculation of loweredSlotNum) is duplicated between the hasGlobalSlotTensor branch and the local pipe branch (lines 10709-10719). This logic should be refactored into a common block at the beginning of the function to improve maintainability and ensure consistency.

gemini-code-assist · 2026-05-09T11:41:50Z

+    if (auto localSlotNumAttr = getLocalSlotNumAttr()) {
+      int32_t localSlotNum = localSlotNumAttr.getInt();
+      if (localSlotNum <= 0)
+        return emitOpError("expects 'local_slot_num' to be greater than 0");
+      if (static_cast<uint32_t>(localSlotNum) > getSlotNum())
+        return emitOpError(
+            "expects 'local_slot_num' to be less than or equal to slot_num");
+    }


The bounds check for local_slot_num is duplicated. It is added here for the case where local_addr is absent but gm_slot_tensor is present, and it already exists later in the function (lines 11469-11476) for the case where local_addr is present. This logic should be unified before the if (!getLocalAddr()) block to avoid redundancy.

reedhecre · 2026-05-09T12:27:14Z

Codex Review

该评论由 review 机器人自动更新。

PR: feat(frontend): allow local_slot_num on gm_slot_tensor pipe init #650 feat(frontend): allow local_slot_num on gm_slot_tensor pipe init
Author: chenshengxin2026
Base/Head: main / feat/gm-slot-tensor-local-slot-num
Head SHA: 5cdc58fbf069
Trigger: 检测到新的 open PR
Generated At: 2026-05-09T12:27:14Z
Status: completed

Summary

globaltensor local_slot_num 的 lowering 引入了 2 个 P2 级兼容性问题：默认值省略/显式写法会被误判为不兼容，且 lowered memref 形态会被内部 verifier 误拒。

Findings

P2 globaltensor 分支未规范化默认 `local_slot_num`，会把等价 peer 配置误判成签名不一致 lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp:130

这里把 frontend 的 local_slot_num 直接转发到 pto.initialize_l2g2l_pipe，而不是像 local-buffer 分支那样先补成有效默认值（单向 8、双向 4）。结果是，只要一侧显式写默认值、另一侧省略该属性，PTOResolveReservedBuffersPass::samePipeInitSignature 就会把 nullopt 和整数当成不同签名并报 requires peer pipe init ops to agree on direction and pipe shape；但 PTOToEmitC 又会把缺省值回退成 slot_num，两侧最终生成的 TPipe 实际相同。这会把语义等价的写法变成编译失败，属于兼容性回归。

P2 internal verifier 只识别 `TensorViewType`，会拒绝带 `local_slot_num` 的 lowered global-only pipe lib/PTO/IR/PTO.cpp:11448

新的放行条件把“globaltensor slot”判断写成 isa<TensorViewType>(getGmAddr().getType())。但 PTOViewToMemref 会把同一个 global-only initialize_l2g2l_pipe 的 gm_addr 重写成 memref<...>，仓库现有的 tpush_tpop_globaltensor_frontend_a3.pto 也已经验证了这类 IR 形态。这样一来，只要这个 PR 新增的 local_slot_num 被带到 lowered op，上述 op 在后续任何再次运行 verifier 的流程里都会被误报为“local_slot_num 只能和 local_addr 一起出现”，造成 lowering 结果和内部 op contract 不一致。

chenshengxin2026 mentioned this pull request May 9, 2026

[WIP] feat: Add PTO-DSL Flash Attention performance kernel with Python validation runner hw-native-sys/pto-isa#117

Open

3 tasks

gemini-code-assist Bot reviewed May 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(frontend): allow local_slot_num on gm_slot_tensor pipe init#650

feat(frontend): allow local_slot_num on gm_slot_tensor pipe init#650
chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
chenshengxin2026:feat/gm-slot-tensor-local-slot-num

chenshengxin2026 commented May 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Uh oh!

gemini-code-assist Bot May 9, 2026

Uh oh!

reedhecre commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chenshengxin2026 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

lib/PTO/IR/PTO.cpp

lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp

test/lit/pto/tpush_tpop_globaltensor_local_slot_num_a3.pto

Backwards compatibility

Out of scope (open question)

Test plan

Related

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

reedhecre commented May 9, 2026

Codex Review

Summary

Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenshengxin2026 commented May 9, 2026 •

edited

Loading

`lib/PTO/IR/PTO.cpp`

`lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp`

`test/lit/pto/tpush_tpop_globaltensor_local_slot_num_a3.pto`