Skip to content

feat(frontend): allow local_slot_num on gm_slot_tensor pipe init#650

Open
chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
chenshengxin2026:feat/gm-slot-tensor-local-slot-num
Open

feat(frontend): allow local_slot_num on gm_slot_tensor pipe init#650
chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
chenshengxin2026:feat/gm-slot-tensor-local-slot-num

Conversation

@chenshengxin2026
Copy link
Copy Markdown

@chenshengxin2026 chenshengxin2026 commented May 9, 2026

Summary

Allow local_slot_num on the address-based (gm_slot_tensor) form of pto.aic_initialize_pipe / pto.aiv_initialize_pipe, so kernels can opt into the manual-aligned TPipe<..., LocalSlotNum=2, ...> template instead of being forced into the current LocalSlotNum=SlotNum=8 default. Non-breaking — when the attribute is absent, the existing lowering behavior is preserved.

Companion issue with full root-cause analysis and reproducer: #649.

Motivation

include/pto/npu/a2a3/TPush.hpp defaults TPipe's LocalSlotNum to 2:

template <uint8_t FlagID, uint8_t DirType, uint32_t SlotSize, uint32_t SlotNum, uint32_t LocalSlotNum = 2, ...>

and kernels/manual/common/flash_atten/fa_performance_kernel.cpp uses LocalSlotNum=2 on QK/PV/P pipes. ptoas-generated address-based pipes today emit LocalSlotNum=8 because:

  1. The frontend verifier rejects local_slot_num on the gm_slot_tensor branch (globaltensor pipe init does not use 'local_slot_num').
  2. Even if it didn't, createFrontendPipe hard-codes IntegerAttr{} for local_slot_num on that branch, so the attribute can't flow through.
  3. buildTPipeTokenFromInitOp then falls back to initOp.getSlotNum() (=8).

The 8/8 vs 2/8 mismatch inflates the FFTS event multiplex (8 local × 8 global per pipe instead of 2 × 8) and overruns --enable-insert-sync's 8-event pool at long sequences (S1>=4096 on a3 in flash-attention).

Changes

lib/PTO/IR/PTO.cpp

Frontend verifier (verifyFrontendInitCommon): drop the blanket

if (op.getLocalSlotNumAttr())
  return op.emitOpError("globaltensor pipe init does not use 'local_slot_num'");

on the gm_slot_tensor branch. Validate local_slot_num the same way the legacy gm_slot_buffer branch does — must be > 0, must be <=8 for dir_mask=1/2 or <=4 for dir_mask=3.

InitializeL2G2LPipeOp::verify: allow local_slot_num on the no-local_addr path when gm_addr is a !pto.tensor_view<...> (the gm_slot_tensor form). The previous rule "'local_slot_num' is only allowed when 'local_addr' is present" was tied to the legacy local-FIFO form and predates the address-based slot model. The bounds check (> 0, <= slot_num) still runs in both branches.

lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp

createFrontendPipe gm_slot_tensor branch: propagate initOp.getLocalSlotNumAttr() into the lowered InitializeL2G2LPipeOp (was hard-coded to IntegerAttr{}). Combined with buildTPipeTokenFromInitOp in PTOToEmitC.cpp, this makes local_slot_num=N on the IR flow through to TPipe<..., N, ...> in the generated C++.

test/lit/pto/tpush_tpop_globaltensor_local_slot_num_a3.pto

New lit test exercising local_slot_num=2 on both aic_initialize_pipe and aiv_initialize_pipe with the gm_slot_tensor form, asserting the lowered TPipe carries ..., 8, 2, ....

Backwards compatibility

When local_slot_num is absent the existing fallback (LocalSlotNum=SlotNum=8 on the address-based path) is preserved, so existing lit tests under test/lit/pto/tpush_tpop_globaltensor_*.pto, test/lit/pto/tpush_tpop_emitc.pto, and test/lit/pto/issue48{1,9}_*.pto continue to pass unchanged.

Out of scope (open question)

Whether the default LocalSlotNum on the address-based path should be 2 (matching the C++ template default) instead of SlotNum is left open. That change touches several existing lit-test expectations and is more invasive; surfacing the attribute is sufficient to unblock the downstream flash-attention kernel. See #649 for discussion.

Test plan

  • ninja check-pto-lit — existing test/lit/pto/tpush_tpop_globaltensor_*.pto should pass unchanged; new tpush_tpop_globaltensor_local_slot_num_a3.pto should pass.
  • Downstream kernels/python/flash_atten/kernels/fa_builder.py in hw-native-sys/pto-isa#117, with the companion pto-dsl PR (feat(api): support address-based pipe slot model + tfree(entry, split) PTO-ISA/pto-dsl#8) exposing local_slot_num= on the Python wrapper, currently relies on a compile.sh post-process that rewrites TPipe<..., 8, 8, false>TPipe<..., 8, 2, false>. With this PR landed, that post-process can be deleted because the IR can now carry local_slot_num=2 directly.
  • Builds locally cleanly (changes are surgical edits to existing verifier / lowering paths). I do not have CI access on this repository.

Related

Previously `pto.aic_initialize_pipe` / `pto.aiv_initialize_pipe` rejected
`local_slot_num` whenever they used the address-based `gm_slot_tensor`
operand:

    'pto.aic_initialize_pipe' op globaltensor pipe init does not use
    'local_slot_num'

That made it impossible for kernels to mirror the manual-flash-attention
TPipe instantiation. The C++ template default in
`include/pto/npu/a2a3/TPush.hpp` is `LocalSlotNum=2`, which is also what
`kernels/manual/common/flash_atten/fa_performance_kernel.cpp` uses on
QK/PV/P pipes. Without an IR-level override, ptoas's address-based pipe
lowering ends up emitting `TPipe<..., 8, 8, false>` (LocalSlotNum=SlotNum)
because `buildTPipeTokenFromInitOp` in `PTOToEmitC.cpp` falls back to
`getSlotNum()` when no `local_slot_num` attribute is present. The
mismatch inflates the FFTS event multiplex (8 local x 8 global per pipe
instead of 2 local x 8 global) and exhausts `--enable-insert-sync`'s
8-event pool at long sequences, e.g. flash-attention `S1>=4096` on a3.

Changes:

1. `lib/PTO/IR/PTO.cpp` - frontend verifier
   - Drop the blanket "globaltensor pipe init does not use 'local_slot_num'"
     rejection in `verifyFrontendInitCommon`.
   - Validate the attribute the same way the legacy `gm_slot_buffer`
     branch does: must be > 0, must be <= 8 for dir_mask=1/2 or 4 for
     dir_mask=3.

2. `lib/PTO/IR/PTO.cpp` - InitializeL2G2LPipeOp::verify
   - Allow `local_slot_num` on the no-`local_addr` path when the
     `gm_addr` operand is a `!pto.tensor_view<...>` (i.e. the
     gm_slot_tensor form). The previous rule
     "'local_slot_num' is only allowed when 'local_addr' is present"
     was tied to the legacy local-FIFO form and predates the
     address-based slot model. The `localSlotNum > 0 && <= slot_num`
     bounds check still runs in both branches.

3. `lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp`
   - In `createFrontendPipe`, propagate `initOp.getLocalSlotNumAttr()`
     into the lowered `InitializeL2G2LPipeOp` for the gm_slot_tensor
     branch (was hard-coded to `IntegerAttr{}`). Combined with the
     existing `buildTPipeTokenFromInitOp` logic in `PTOToEmitC.cpp`,
     this makes `local_slot_num=N` on the IR flow through to
     `TPipe<..., N, ...>` in the generated C++.

4. `test/lit/pto/tpush_tpop_globaltensor_local_slot_num_a3.pto`
   - New lit test exercising `local_slot_num=2` on both
     `aic_initialize_pipe` and `aiv_initialize_pipe` with the
     gm_slot_tensor form, asserting the lowered TPipe carries
     `..., 8, 2, ...`.

Backwards compatibility: when `local_slot_num` is absent the existing
fallback (LocalSlotNum=SlotNum=8 on the address-based path,
LocalSlotNum=SlotNum on the legacy path) is preserved, so existing tests
under `test/lit/pto/tpush_tpop_globaltensor_*.pto` continue to pass
unchanged.

Note: the wider question of whether the *default* LocalSlotNum on the
address-based path should be 2 (matching the C++ TPipe template default)
instead of SlotNum is left open. That change touches several existing
lit-test expectations and is more invasive; surfacing the attribute is
sufficient to unblock the manual-parity flash-attention kernel in
hw-native-sys/pto-isa#117.

Motivating downstream:
- hw-native-sys/pto-isa#117 - PTO-DSL Flash Attention performance kernel
  consumed via PTO-ISA/pto-dsl#8 (companion frontend PR exposing
  `local_slot_num=` on the Python wrapper).
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables the 'local_slot_num' attribute for globaltensor pipe initializations, incorporating validation checks in the IR and ensuring propagation during lowering. A new test case is added to verify the generated output. The review feedback recommends refactoring duplicated validation logic in 'lib/PTO/IR/PTO.cpp' to improve maintainability.

Comment thread lib/PTO/IR/PTO.cpp
Comment on lines +10680 to +10690
if (auto localSlotNumAttr = op.getLocalSlotNumAttr()) {
int32_t localSlotNum = localSlotNumAttr.getInt();
if (localSlotNum <= 0)
return op.emitOpError("expects 'local_slot_num' to be greater than 0");
int32_t loweredSlotNum = dirMask == 3 ? 4 : 8;
if (localSlotNum > loweredSlotNum) {
return op.emitOpError()
<< "expects 'local_slot_num' to be less than or equal to "
<< loweredSlotNum << " for dir_mask = " << static_cast<int>(dirMask);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The validation logic for local_slot_num (including the bounds check and the calculation of loweredSlotNum) is duplicated between the hasGlobalSlotTensor branch and the local pipe branch (lines 10709-10719). This logic should be refactored into a common block at the beginning of the function to improve maintainability and ensure consistency.

Comment thread lib/PTO/IR/PTO.cpp
Comment on lines +11458 to +11465
if (auto localSlotNumAttr = getLocalSlotNumAttr()) {
int32_t localSlotNum = localSlotNumAttr.getInt();
if (localSlotNum <= 0)
return emitOpError("expects 'local_slot_num' to be greater than 0");
if (static_cast<uint32_t>(localSlotNum) > getSlotNum())
return emitOpError(
"expects 'local_slot_num' to be less than or equal to slot_num");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The bounds check for local_slot_num is duplicated. It is added here for the case where local_addr is absent but gm_slot_tensor is present, and it already exists later in the function (lines 11469-11476) for the case where local_addr is present. This logic should be unified before the if (!getLocalAddr()) block to avoid redundancy.

@reedhecre
Copy link
Copy Markdown

Codex Review

该评论由 review 机器人自动更新。

Summary

globaltensor local_slot_num 的 lowering 引入了 2 个 P2 级兼容性问题:默认值省略/显式写法会被误判为不兼容,且 lowered memref 形态会被内部 verifier 误拒。

Findings

  1. P2 globaltensor 分支未规范化默认 `local_slot_num`,会把等价 peer 配置误判成签名不一致 lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp:130

这里把 frontend 的 local_slot_num 直接转发到 pto.initialize_l2g2l_pipe,而不是像 local-buffer 分支那样先补成有效默认值(单向 8、双向 4)。结果是,只要一侧显式写默认值、另一侧省略该属性,PTOResolveReservedBuffersPass::samePipeInitSignature 就会把 nullopt 和整数当成不同签名并报 requires peer pipe init ops to agree on direction and pipe shape;但 PTOToEmitC 又会把缺省值回退成 slot_num,两侧最终生成的 TPipe 实际相同。这会把语义等价的写法变成编译失败,属于兼容性回归。

  1. P2 internal verifier 只识别 `TensorViewType`,会拒绝带 `local_slot_num` 的 lowered global-only pipe lib/PTO/IR/PTO.cpp:11448

新的放行条件把“globaltensor slot”判断写成 isa<TensorViewType>(getGmAddr().getType())。但 PTOViewToMemref 会把同一个 global-only initialize_l2g2l_pipegm_addr 重写成 memref<...>,仓库现有的 tpush_tpop_globaltensor_frontend_a3.pto 也已经验证了这类 IR 形态。这样一来,只要这个 PR 新增的 local_slot_num 被带到 lowered op,上述 op 在后续任何再次运行 verifier 的流程里都会被误报为“local_slot_num 只能和 local_addr 一起出现”,造成 lowering 结果和内部 op contract 不一致。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants