feat(frontend): allow local_slot_num on gm_slot_tensor pipe init#650
feat(frontend): allow local_slot_num on gm_slot_tensor pipe init#650chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
Previously `pto.aic_initialize_pipe` / `pto.aiv_initialize_pipe` rejected
`local_slot_num` whenever they used the address-based `gm_slot_tensor`
operand:
'pto.aic_initialize_pipe' op globaltensor pipe init does not use
'local_slot_num'
That made it impossible for kernels to mirror the manual-flash-attention
TPipe instantiation. The C++ template default in
`include/pto/npu/a2a3/TPush.hpp` is `LocalSlotNum=2`, which is also what
`kernels/manual/common/flash_atten/fa_performance_kernel.cpp` uses on
QK/PV/P pipes. Without an IR-level override, ptoas's address-based pipe
lowering ends up emitting `TPipe<..., 8, 8, false>` (LocalSlotNum=SlotNum)
because `buildTPipeTokenFromInitOp` in `PTOToEmitC.cpp` falls back to
`getSlotNum()` when no `local_slot_num` attribute is present. The
mismatch inflates the FFTS event multiplex (8 local x 8 global per pipe
instead of 2 local x 8 global) and exhausts `--enable-insert-sync`'s
8-event pool at long sequences, e.g. flash-attention `S1>=4096` on a3.
Changes:
1. `lib/PTO/IR/PTO.cpp` - frontend verifier
- Drop the blanket "globaltensor pipe init does not use 'local_slot_num'"
rejection in `verifyFrontendInitCommon`.
- Validate the attribute the same way the legacy `gm_slot_buffer`
branch does: must be > 0, must be <= 8 for dir_mask=1/2 or 4 for
dir_mask=3.
2. `lib/PTO/IR/PTO.cpp` - InitializeL2G2LPipeOp::verify
- Allow `local_slot_num` on the no-`local_addr` path when the
`gm_addr` operand is a `!pto.tensor_view<...>` (i.e. the
gm_slot_tensor form). The previous rule
"'local_slot_num' is only allowed when 'local_addr' is present"
was tied to the legacy local-FIFO form and predates the
address-based slot model. The `localSlotNum > 0 && <= slot_num`
bounds check still runs in both branches.
3. `lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp`
- In `createFrontendPipe`, propagate `initOp.getLocalSlotNumAttr()`
into the lowered `InitializeL2G2LPipeOp` for the gm_slot_tensor
branch (was hard-coded to `IntegerAttr{}`). Combined with the
existing `buildTPipeTokenFromInitOp` logic in `PTOToEmitC.cpp`,
this makes `local_slot_num=N` on the IR flow through to
`TPipe<..., N, ...>` in the generated C++.
4. `test/lit/pto/tpush_tpop_globaltensor_local_slot_num_a3.pto`
- New lit test exercising `local_slot_num=2` on both
`aic_initialize_pipe` and `aiv_initialize_pipe` with the
gm_slot_tensor form, asserting the lowered TPipe carries
`..., 8, 2, ...`.
Backwards compatibility: when `local_slot_num` is absent the existing
fallback (LocalSlotNum=SlotNum=8 on the address-based path,
LocalSlotNum=SlotNum on the legacy path) is preserved, so existing tests
under `test/lit/pto/tpush_tpop_globaltensor_*.pto` continue to pass
unchanged.
Note: the wider question of whether the *default* LocalSlotNum on the
address-based path should be 2 (matching the C++ TPipe template default)
instead of SlotNum is left open. That change touches several existing
lit-test expectations and is more invasive; surfacing the attribute is
sufficient to unblock the manual-parity flash-attention kernel in
hw-native-sys/pto-isa#117.
Motivating downstream:
- hw-native-sys/pto-isa#117 - PTO-DSL Flash Attention performance kernel
consumed via PTO-ISA/pto-dsl#8 (companion frontend PR exposing
`local_slot_num=` on the Python wrapper).
There was a problem hiding this comment.
Code Review
This pull request enables the 'local_slot_num' attribute for globaltensor pipe initializations, incorporating validation checks in the IR and ensuring propagation during lowering. A new test case is added to verify the generated output. The review feedback recommends refactoring duplicated validation logic in 'lib/PTO/IR/PTO.cpp' to improve maintainability.
| if (auto localSlotNumAttr = op.getLocalSlotNumAttr()) { | ||
| int32_t localSlotNum = localSlotNumAttr.getInt(); | ||
| if (localSlotNum <= 0) | ||
| return op.emitOpError("expects 'local_slot_num' to be greater than 0"); | ||
| int32_t loweredSlotNum = dirMask == 3 ? 4 : 8; | ||
| if (localSlotNum > loweredSlotNum) { | ||
| return op.emitOpError() | ||
| << "expects 'local_slot_num' to be less than or equal to " | ||
| << loweredSlotNum << " for dir_mask = " << static_cast<int>(dirMask); | ||
| } | ||
| } |
There was a problem hiding this comment.
The validation logic for local_slot_num (including the bounds check and the calculation of loweredSlotNum) is duplicated between the hasGlobalSlotTensor branch and the local pipe branch (lines 10709-10719). This logic should be refactored into a common block at the beginning of the function to improve maintainability and ensure consistency.
| if (auto localSlotNumAttr = getLocalSlotNumAttr()) { | ||
| int32_t localSlotNum = localSlotNumAttr.getInt(); | ||
| if (localSlotNum <= 0) | ||
| return emitOpError("expects 'local_slot_num' to be greater than 0"); | ||
| if (static_cast<uint32_t>(localSlotNum) > getSlotNum()) | ||
| return emitOpError( | ||
| "expects 'local_slot_num' to be less than or equal to slot_num"); | ||
| } |
There was a problem hiding this comment.
The bounds check for local_slot_num is duplicated. It is added here for the case where local_addr is absent but gm_slot_tensor is present, and it already exists later in the function (lines 11469-11476) for the case where local_addr is present. This logic should be unified before the if (!getLocalAddr()) block to avoid redundancy.
Codex Review该评论由 review 机器人自动更新。
Summaryglobaltensor Findings
这里把 frontend 的
新的放行条件把“globaltensor slot”判断写成 |
Summary
Allow
local_slot_numon the address-based (gm_slot_tensor) form ofpto.aic_initialize_pipe/pto.aiv_initialize_pipe, so kernels can opt into the manual-alignedTPipe<..., LocalSlotNum=2, ...>template instead of being forced into the currentLocalSlotNum=SlotNum=8default. Non-breaking — when the attribute is absent, the existing lowering behavior is preserved.Companion issue with full root-cause analysis and reproducer: #649.
Motivation
include/pto/npu/a2a3/TPush.hppdefaultsTPipe'sLocalSlotNumto2:and
kernels/manual/common/flash_atten/fa_performance_kernel.cppusesLocalSlotNum=2on QK/PV/P pipes. ptoas-generated address-based pipes today emitLocalSlotNum=8because:local_slot_numon thegm_slot_tensorbranch (globaltensor pipe init does not use 'local_slot_num').createFrontendPipehard-codesIntegerAttr{}forlocal_slot_numon that branch, so the attribute can't flow through.buildTPipeTokenFromInitOpthen falls back toinitOp.getSlotNum()(=8).The 8/8 vs 2/8 mismatch inflates the FFTS event multiplex (8 local × 8 global per pipe instead of 2 × 8) and overruns
--enable-insert-sync's 8-event pool at long sequences (S1>=4096on a3 in flash-attention).Changes
lib/PTO/IR/PTO.cppFrontend verifier (
verifyFrontendInitCommon): drop the blanketon the gm_slot_tensor branch. Validate
local_slot_numthe same way the legacygm_slot_bufferbranch does — must be> 0, must be<=8fordir_mask=1/2or<=4fordir_mask=3.InitializeL2G2LPipeOp::verify: allowlocal_slot_numon the no-local_addrpath whengm_addris a!pto.tensor_view<...>(the gm_slot_tensor form). The previous rule "'local_slot_num' is only allowed when 'local_addr' is present" was tied to the legacy local-FIFO form and predates the address-based slot model. The bounds check (> 0,<= slot_num) still runs in both branches.lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cppcreateFrontendPipegm_slot_tensor branch: propagateinitOp.getLocalSlotNumAttr()into the loweredInitializeL2G2LPipeOp(was hard-coded toIntegerAttr{}). Combined withbuildTPipeTokenFromInitOpinPTOToEmitC.cpp, this makeslocal_slot_num=Non the IR flow through toTPipe<..., N, ...>in the generated C++.test/lit/pto/tpush_tpop_globaltensor_local_slot_num_a3.ptoNew lit test exercising
local_slot_num=2on bothaic_initialize_pipeandaiv_initialize_pipewith the gm_slot_tensor form, asserting the lowered TPipe carries..., 8, 2, ....Backwards compatibility
When
local_slot_numis absent the existing fallback (LocalSlotNum=SlotNum=8 on the address-based path) is preserved, so existing lit tests undertest/lit/pto/tpush_tpop_globaltensor_*.pto,test/lit/pto/tpush_tpop_emitc.pto, andtest/lit/pto/issue48{1,9}_*.ptocontinue to pass unchanged.Out of scope (open question)
Whether the default
LocalSlotNumon the address-based path should be2(matching the C++ template default) instead ofSlotNumis left open. That change touches several existing lit-test expectations and is more invasive; surfacing the attribute is sufficient to unblock the downstream flash-attention kernel. See #649 for discussion.Test plan
ninja check-pto-lit— existingtest/lit/pto/tpush_tpop_globaltensor_*.ptoshould pass unchanged; newtpush_tpop_globaltensor_local_slot_num_a3.ptoshould pass.kernels/python/flash_atten/kernels/fa_builder.pyinhw-native-sys/pto-isa#117, with the companion pto-dsl PR (feat(api): support address-based pipe slot model + tfree(entry, split) PTO-ISA/pto-dsl#8) exposinglocal_slot_num=on the Python wrapper, currently relies on acompile.shpost-process that rewritesTPipe<..., 8, 8, false>→TPipe<..., 8, 2, false>. With this PR landed, that post-process can be deleted because the IR can now carrylocal_slot_num=2directly.Related