Skip to content

Mutli Buffer Support #615

Open
zhangstevenunity wants to merge 15 commits intomainfrom
enhance-pto-plan-memory-spec-level3-retry
Open

Mutli Buffer Support #615
zhangstevenunity wants to merge 15 commits intomainfrom
enhance-pto-plan-memory-spec-level3-retry

Conversation

@zhangstevenunity
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for multi-buffering and automated memory planning within the PTO framework. Key changes include the addition of the PTOEnableMultiBuffer pass to lower variadic pointer casts, dynamic event ID selection in SyncCodegen using modulo arithmetic, and a significant refactor of the PlanMemory pass. The memory planner now supports N-buffer siblings, automatic sizing of reserve_buffer based on FIFO requirements, and a retry mechanism with deterministic shuffling to mitigate order-sensitive allocation failures. Feedback identifies critical issues regarding missing loop-invariance checks when hoisting operations, ambiguous tracking of planned offsets using zero as a sentinel, and the need for graceful error diagnostics instead of compiler crashes when encountering invalid input IR.

Comment on lines +42 to +53
rewriter.setInsertionPoint(forOp);
SmallVector<Value> slotBufs;
slotBufs.reserve(n);
for (unsigned i = 0; i < n; ++i) {
auto oneAddr = addrs.slice(i, 1);
PointerCastOp slot = rewriter.create<PointerCastOp>(
loc, resTy, oneAddr, validRow, validCol,
config.has_value()
? static_cast<Attribute>(*config)
: Attribute());
slotBufs.push_back(slot.getResult());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The pass hoists pto.pointer_cast operations outside the enclosing scf.for loop without verifying that all operands (addresses, validRow, validCol) are loop-invariant. If any operand is defined inside the loop, this hoisting will result in invalid IR where a value is used before its definition. You should check if all operands are defined outside the loop before attempting to hoist.

Comment thread lib/PTO/Transforms/PTOPlanMemory.cpp Outdated
Comment on lines +2152 to +2156
bool hasRelationSlots =
!e->relationOtherBuffers.empty() &&
llvm::all_of(e->relationOtherBuffers, [](StorageEntry *s) {
return s->bitsOffset != 0;
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using bitsOffset != 0 as a check to determine if a StorageEntry has been planned is unreliable because 0 is a valid planned offset (e.g., for the first buffer in a memory space or for zero-sized entries as seen in line 2052). This ambiguity can cause hasRelationSlots to be incorrectly false, leading to skipped conflict checks in VerifyConflictStage1. Consider using a sentinel value like std::numeric_limits<uint64_t>::max() or an explicit boolean flag to track whether an entry has been planned.

Comment on lines +1081 to +1082
llvm_unreachable("The buffer memory has been released and cannot be "
"used again before it is redefined! ");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using llvm_unreachable to handle potential use-after-free scenarios in the input IR is too aggressive as it will crash the compiler. It is better to emit a diagnostic error using op->emitError() and return a failure status to allow the compiler to exit gracefully.

@reedhecre
Copy link
Copy Markdown

reedhecre commented May 3, 2026

Codex Review

该评论由 review 机器人自动更新。

  • PR: Mutli Buffer Support  #615 Mutli Buffer Support
  • Author: zhangstevenunity
  • Base/Head: main / enhance-pto-plan-memory-spec-level3-retry
  • Head SHA: bfacd43f863a
  • Trigger: PR 有新提交
  • Generated At: 2026-05-09T08:13:08Z
  • Previous Head SHA: 9a7047c419df
  • Status: completed

Summary

PR #615 introduces wrong-code risks in multi-buffer lowering and dynamic event-id selection, plus a regression in InsertSync's multi-buffer dedup path.

Findings

  1. P1 Variadic `pto.pointer_cast` can reach EmitC and lose every slot except slot 0 tools/ptoas/ptoas.cpp:1190

The new pipeline only runs pto-enable-multi-buffer when --enable-multi-buffer-lowering is passed, and the pass itself also continues on unsupported casts. Any surviving variadic pto.pointer_cast then goes through PointerCastConversion, which still lowers adaptor.getAddrs()[0] only (lib/PTO/Transforms/PTOToEmitC.cpp:4036). That silently compiles {pto.multi_buffer = N} kernels as single-slot buffers while InsertSync/GSS may already have emitted multi-buffer synchronization or skipped same-pipe barriers based on N > 1, so this is wrong-code rather than a missed optimization.

  1. P1 `PTOEnableMultiBuffer` rotates slots with the raw IV instead of iteration number lib/PTO/Transforms/PTOEnableMultiBuffer.cpp:45

getOrCreateLoopCounter uses remui %iv, %cN directly. For loops with lowerBound != 0 or step != 1, iv mod N is not the iteration index: e.g. scf.for %i = %c4 to %c12 step %c2 with N=2 always yields 0, so every iteration selects slot 0 and the multi-buffered buffer never rotates. The new GSS code normalizes (iv - lowerBound) / step before remui, so the data-path lowering and sync-path lowering also disagree on these loops.

  1. P1 InsertSync dynamic event-id selection has the same non-normalized IV bug lib/PTO/Transforms/InsertSync/SyncCodegen.cpp:397

SyncCodegen::GetBufferSelected also computes remui %iv, %cN on the raw induction variable. On non-zero-lower-bound or step>1 loops this picks a shifted or constant EVENT_ID sequence, unlike the GSS path. That is not just cosmetic: InsertSyncAnalysis now suppresses same-pipe back-edge barriers whenever eventIdNum >= 2 (lib/PTO/Transforms/InsertSync/InsertSyncAnalysis.cpp:380), so a step-2 PIPE_M/PIPE_V loop can lose the barrier under the assumption that iterations rotate across slots even though the generated selector keeps reusing the same slot/event id.

  1. P2 Backward multi-buffer syncs never hit the new already-synced fast path lib/PTO/Transforms/InsertSync/InsertSyncAnalysis.cpp:298

GetEventIdNum was changed to require the back-edge scf.for, but MemAnalyze still calls the old no-loop form. After this PR that call always returns 1 for backward deps, so the for (i = 1; i < eventIdNum; ++i) probe at lines 298-304 never runs. That combines badly with UpdateSyncRecordInfo, which intentionally leaves slot 0 unmarked for eventIdNum > 1 (lines 508-510): multi-buffer syncs stop participating in transitive-elimination, extra set/wait pairs are emitted, and event-id pressure can regress all the way to the new single-id / PIPE_ALL fallbacks.

@zhangstevenunity zhangstevenunity changed the title Mutli Buffer Support and Enhance pto plan memory spec level3 retry Mutli Buffer Support May 3, 2026
zhangstevenunity and others added 3 commits May 3, 2026 15:05
PR #615 added the SyncCodegen scaffolding for multi-buffer set/wait_flag_dyn
ops, but the upstream analysis path was inert: GetEventIdNum hard-coded
return 1, and PTOIRTranslator dropped every pto.pointer_cast addr operand
beyond the first. The N>1 multi-buffer code path was therefore unreachable.

This change makes the PlanMemory -> InsertSync handshake actually work:

- PTOIRTranslator::UpdatePointerCastOpMemInfo now translates *all* N
  pto.pointer_cast i64 addr operands into BaseMemInfo.baseAddresses
  (constants -> bit offsets, non-constants -> hasVariableAddress=true).
  Subview/alias deltaOffset is applied to every slot, not just slot 0.

- MemoryDependentAnalyzer gains getMultiBufferSlotCount(a, b) implementing
  HIVM's per-slot geometry check: same-index slots must overlap (real
  back-edge dep on the same physical buffer) and different-index slots must
  NOT overlap (so consecutive iterations land in disjoint physical buffers).

- InsertSyncAnalysis::GetEventIdNum is rewritten to follow HIVM semantics:
  every dependent pair must be multi-buffer-eligible, all pairs must agree
  on N, and every involved buffer must hang off the same scf.for.

Verified end-to-end: a `pto.multi_buffer = 2` alloc inside scf.for now
emits 2 reserved event ids, an `iv mod 2` arith.select chain, and
`pto.{set,wait}_flag_dyn` ops driven by the selected idx. New regression
guards this in test/lit/pto/multi_buffer_insert_sync_dyn_event_id.pto.

Co-Authored-By: Claude <noreply@anthropic.com>
Builds on P0 to make the multi-buffer pipeline robust under realistic
pressure:

- SyncEventIdAllocation: HIVM-style TryFallbackMultiBuffer. When the
  event-id pool can't satisfy N, fall back to N=2 first (when N is even
  and >2) then to 1, instead of silently returning with an empty
  eventIds vec. Both set/wait siblings get their `eventIdNum` updated
  in lockstep so SyncCodegen takes the same code path on each side.

- PTOPlanMemory: add `StorageEntry::isMultiBufferSlot` flag and rewrite
  `UpdateBuffer2Offsets` to walk via primaries and emit slots in
  declared `relationOtherBuffers` order. This makes the slot-ordering
  contract that EnableMultiBuffer relies on (`offsets[i]` <-> iv mod N
  selector index `i`) explicit and verifies it via a runtime assertion.

- PTOEnableMultiBuffer: add scope guard (skip non-VEC/MAT casts where
  the iv-mod-N selector is not meaningful) and loop-invariance guard
  (skip casts whose addrs are not loop-invariant - hoisting them above
  the for loop would break SSA dominance).

- New regression at test/lit/pto/multi_buffer_n4_insert_sync.pto: N=4
  slot ordering [0, 1024, 2048, 3072] in pto.pointer_cast, 3-way
  arith.select chain, and 4 distinct event ids drained at function end.

Co-Authored-By: Claude <noreply@anthropic.com>
Final hardening pass tightening multi-buffer correctness around the edges
PR-615 and earlier P0/P1 commits left ambiguous.

- InsertSyncAnalysis: GetEventIdNum now takes the back-edge scf.for and
  requires every involved buffer's enclosing loop to *equal* that loop.
  Previously a multi-buffer alloc nested inside an inner loop could
  silently rotate slots on the wrong induction variable when a back-edge
  on an outer loop carried the dep. Forward deps unconditionally collapse
  to single-buffer.

- PTOEnableMultiBuffer: cache `iv mod N` per (loop, N) so several
  multi-buffer pointer_casts sharing a loop reuse one counter, mirroring
  SyncCodegen::loop2BufferCounter. Avoids redundant arith.remui +
  constant ops in the loop body.

- PTOPlanMemory:
  * `MemPlan::emitMultiBufferError` + `multiBufferDiagnosticEmitted_`
    flag converts multi-buffer-specific `report_fatal_error` calls
    (slot-order mismatch, pong outline placement failure) into
    diagnostics that bubble through `plan()` as `failure()`. This lets
    the existing PlanMemoryPass retry loop re-seed instead of aborting
    the compiler under heavy multi-buffer memory pressure.
  * `VerifyConflictStage1` now enumerates *every* historical
    multi-buffer slot offset (via `CollectMultiRelationPongAnchors`)
    as a SPEC_LEVEL_1 reuse anchor candidate. PR-615 only used the last
    slot, silently dropping reuse for N > 2.

- New regression at test/lit/pto/multi_buffer_nested_loop.pto: nested
  scf.for with multi-buffer alloc in the inner loop must rotate slots on
  the inner iv, not the outer one.

Co-Authored-By: Claude <noreply@anthropic.com>
@zhangstevenunity zhangstevenunity marked this pull request as ready for review May 3, 2026 07:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7db3f876e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/PTO/Transforms/PTOPlanMemory.cpp Outdated
Comment on lines +2462 to +2467
auto parentLoop1 = GetBufferParentLoop(e1->inplaceBuffers);
auto parentLoop2 = GetBufferParentLoop(e2->inplaceBuffers);
if (parentLoop1 != parentLoop2) {
return false;
}
return true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require actual pipe conflict in SPEC_LEVEL_2 filtering

VerifyConflictStage2 says it should reject reuse only when buffers both share a parent loop and have a DMA pipe conflict, but PipeConflictInSameLoop currently returns true for any pair in the same loop (including both top-level buffers with nullptr parent loops). This turns almost all same-scope candidates into conflicts, which can unnecessarily block reuse and cause local-memory planning to fail/overflow in cases that previously fit.

Useful? React with 👍 / 👎.

Comment on lines +70 to +72
PointerCastOp slot = rewriter.create<PointerCastOp>(
loc, resTy, oneAddr, validRow, validCol,
config.has_value()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check hoisted pointer-cast dims are loop-invariant

The multi-buffer lowering hoists new single-address pto.pointer_cast ops before the loop, but those ops still consume valid_row/valid_col. Only addrs are checked for loop-invariance, so if either valid-dim operand is defined inside the loop (e.g., loop-dependent dynamic bounds), the transformed IR will reference out-of-scope values and break SSA correctness. The pass should either validate these operands are defined outside the loop or keep the cast inside the loop when they are not.

Useful? React with 👍 / 👎.

zhangstevenunity and others added 3 commits May 3, 2026 15:25
`VerifyConflictStage2`'s contract (per its own comment) is "block reuse
only when buffers (a) share a parent loop AND (b) have a DMA pipe
conflict". `PipeConflictInSameLoop` violated that on two counts:

1. After confirming same parent loop, it returned true unconditionally
   instead of querying `PipeConflict` for the actual DMA pipe-conflict
   relation. SPEC_LEVEL_2 was therefore stricter than SPEC_LEVEL_3
   semantically: any same-loop pair was rejected, even when no DMA
   pipe conflict existed.

2. `GetBufferParentLoop` returns nullptr for top-level buffers; two
   top-level buffers compared parentLoop1 == parentLoop2 == nullptr,
   so the "different loops -> allow reuse" early-return was bypassed
   and the function fell through to "return true". Almost every
   cross-buffer pair at function scope was getting marked as
   conflicting, blocking valid reuse and causing local-memory
   planning to fail/overflow in cases that previously fit.

Fix: reject the nullptr case explicitly, then fall through to the
real `PipeConflict` query before declaring a conflict.

166/166 lit tests still pass; the change can only widen the set of
allowed reuses, never narrow it.

Co-Authored-By: Claude <noreply@anthropic.com>
After merging upstream/main, multi-buffer-enabled loops emitted spurious
`pto.barrier <PIPE_MTE2>` and `pto.barrier <PIPE_MTE3>` ops in addition
to the multi-buffer `set_flag_dyn` / `wait_flag_dyn` pair, e.g.:

  pto.wait_flag_dyn[<PIPE_MTE3>, <PIPE_MTE2>, %2]
  pto.barrier <PIPE_MTE2>            <- redundant
  pto.tload                           ; (MTE2)
  pto.set_flag[<PIPE_MTE2>, <PIPE_MTE3>, <EVENT_ID0>]
  pto.wait_flag[<PIPE_MTE2>, <PIPE_MTE3>, <EVENT_ID0>]
  pto.barrier <PIPE_MTE3>            <- redundant
  pto.tstore                          ; (MTE3)
  pto.set_flag_dyn[<PIPE_MTE3>, <PIPE_MTE2>, %2]

Both barriers came from `InsertSyncOperation`'s same-pipe back-edge
branch unconditionally emitting `PIPE_BARRIER` for any cross-iteration
dep on a single pipe. Two of these were truly redundant:

1. Same-pipe back-edge whose dep is multi-buffer eligible. The whole
   point of multi-buffer is that consecutive iterations land in
   disjoint physical slots, so the cross-iter "same-address" dep is
   fundamentally false; the multi-buffer dyn flag pair already does
   the cross-iter ordering. No barrier needed.

2. Same-pipe back-edge on a DMA pipe (MTE1/MTE2/MTE3/MTE4/MTE5). DMA
   pipes are simple in-order command queues; the hardware itself
   serializes consecutive ops on the same DMA pipe across iterations.
   PIPE_BARRIER on a DMA pipe is a no-op and just clutters the IR.

Fix: in `InsertSyncOperation`, compute the multi-buffer eventIdNum once
(also resolves the back-edge scf.for so the cross-pipe path can reuse
it - small refactor of the existing logic) and skip the barrier when
either condition above holds. PIPE_M / PIPE_V keep the conservative
PIPE_BARRIER to preserve the "bar_m" / "bar_v" intra-pipe semantics
that higher-level frontends rely on (the existing comment in
`IsNoNeedToInsertSync` calls this out).

168/168 lit pass, including all multi-buffer regressions and the
upstream-merged comm/sync tests.

Co-Authored-By: Claude <noreply@anthropic.com>
@zhangstevenunity
Copy link
Copy Markdown
Collaborator Author

/run all

@reedhecre
Copy link
Copy Markdown

已接收 /run all,A3 板测器会处理这条请求。

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

zhangstevenunity and others added 5 commits May 4, 2026 08:56
…ehmann-807dab

# Conflicts:
#	lib/PTO/Transforms/CMakeLists.txt
#	tools/ptoas/ptoas.cpp
Wire multi-buffer support into the GraphSyncSolver path so it matches the
behaviour the InsertSync path got in earlier P0/P1/P2 commits. Mirrors the
intra-core subset of `bishengir/lib/Dialect/HIVM/Transforms/GraphSyncSolver`
(`SyncSolver::getEventIdInfo`, `getMultiBufferEventIdInfo`,
`getMultiBufferLoop`, `SyncSolverCodeGen::getMultiBufferSelectOp`).

Detection (Utility.h, SyncSolver.h/.cpp):
- `EventIdInfo` (eventIdNum + multibufferLoop) carried per ConflictPair.
- `getMultiBufferLoop` finds the common scf.for whose iv drives the slot
  rotation; reuses the per-slot overlap helper added in P0
  (MemoryDependentAnalyzer::getMultiBufferSlotCount), so the same
  same-index-overlap / different-index-disjoint geometry rule applies.
- `getMultiBufferEventIdInfo` requires every dependent pair to agree on N
  and share the same scf.for; failure -> single-buffer.
- `getEventIdInfo` is the top-level wrapper with the backward-only gate
  (mirrors HIVM's "only optimise back-edge deps").

Solver wiring (handleSetWaitConflict):
- Threads rwOp1/rwOp2 in so MB info is computable.
- Allocates N event ids via the existing Welsh-Powell coloring node
  (EventIdNode already supports eventIdNum > 1).
- Falls back to N=1, then PIPE_ALL barrier, if N ids cannot be coloured.

Codegen (SyncSolverCodeGen.h/.cpp):
- New `emitMultiBufferSetWait` emits the HIVM/InsertSync output shape:
  pre-loop N pto.set_flag (queue prime), in-body iv-mod-N counter +
  N-way arith.select chain + pto.{wait,set}_flag_dyn, post-loop N
  pto.wait_flag (queue drain). The dyn ops live INSIDE the loop body so
  they share the selector's dominance (GSS's default backward-sync
  anchors are at the loop boundary, which works for single-buffer but
  not for a per-iteration selector).
- `loop2BufferCounter_` cache reuses one `iv mod N` across multiple
  ConflictPairs sharing a (loop, N) tuple.

Test:
- New `multi_buffer_gss_dyn_event_id.pto` is the GSS counterpart of the
  existing InsertSync N=2 regression; checks pre-loop primes, the
  selector chain, dyn set/wait, and post-loop drains.

191/191 lit pass.

Co-Authored-By: Claude <noreply@anthropic.com>
Removed references to AscendNPU-IR and clarified the design scope for PTOAS PR615 multi-buffer.
@reedhecre
Copy link
Copy Markdown

A3 板测失败

  • 触发方式:manual
  • 源码提交:d10a9ac918a7
  • 结果汇总:OK 0 / FAIL 0 / SKIP 0
  • 日志:/home/zhongxuan/ptoas-board-monitor/runtime/logs/20260504_082905_manual_pr615.log
  • 手动指令:/run all
  • 触发人:zhangstevenunity
  • 触发评论:Mutli Buffer Support  #615 (comment)
  • 失败阶段:board-validation / exit=143

日志尾部

==========
===== END STAGE sample-build-and-test rc=0 @ 2026-05-04 08:34:33 =====
pto-isa vendor cache hit: repo=https://gitcode.com/cann/pto-isa.git requested_commit=662d7f2a916d6bbde3109ce4a16ed5c28f5d900a actual_commit=662d7f2a916d6bbde3109ce4a16ed5c28f5d900a

===== STAGE board-validation @ 2026-05-04 08:34:34 =====
task-submit cwd=/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260504_082905_manual_pr615/payload
task-submit env-file=/home/zhongxuan/ptoas-board-monitor/runtime/runs/20260504_082905_manual_pr615/board-validation.env
task-submit run-script:
set -euo pipefail
cd /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260504_082905_manual_pr615/payload
export DEVICE_ID=${TASK_DEVICE:-auto}
bash ./test/npu_validation/scripts/run_remote_npu_validation.sh
task-submit wrapped-command: bash -lc "set -euo pipefail; cd /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260504_082905_manual_pr615/payload; export DEVICE_ID=${TASK_DEVICE:-auto}; bash ./test/npu_validation/scripts/run_remote_npu_validation.sh"
task-submit submit-cmd: /usr/local/bin/task-submit --device auto --max-time 0 --env-file /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260504_082905_manual_pr615/board-validation.env 'bash -lc "set -euo pipefail; cd /home/zhongxuan/ptoas-board-monitor/runtime/runs/20260504_082905_manual_pr615/payload; export DEVICE_ID=${TASK_DEVICE:-auto}; bash ./test/npu_validation/scripts/run_remote_npu_validation.sh"'
task_20260504_083434_215235327755
task-submit task-id: task_20260504_083434_215235327755
等待任务执行: task_20260504_083434_215235327755 (Ctrl+C 终止任务)
[npu-lock] 获取设备 5 的锁 (无超时)...

Session terminated, killing shell... ...killed.
=== 任务已终止 (killed) ===
task-submit wait rc=143
completed (exit=143)
===== END STAGE board-validation rc=143 @ 2026-05-06 09:40:56 =====

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants