Mutli Buffer Support #615
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for multi-buffering and automated memory planning within the PTO framework. Key changes include the addition of the PTOEnableMultiBuffer pass to lower variadic pointer casts, dynamic event ID selection in SyncCodegen using modulo arithmetic, and a significant refactor of the PlanMemory pass. The memory planner now supports N-buffer siblings, automatic sizing of reserve_buffer based on FIFO requirements, and a retry mechanism with deterministic shuffling to mitigate order-sensitive allocation failures. Feedback identifies critical issues regarding missing loop-invariance checks when hoisting operations, ambiguous tracking of planned offsets using zero as a sentinel, and the need for graceful error diagnostics instead of compiler crashes when encountering invalid input IR.
| rewriter.setInsertionPoint(forOp); | ||
| SmallVector<Value> slotBufs; | ||
| slotBufs.reserve(n); | ||
| for (unsigned i = 0; i < n; ++i) { | ||
| auto oneAddr = addrs.slice(i, 1); | ||
| PointerCastOp slot = rewriter.create<PointerCastOp>( | ||
| loc, resTy, oneAddr, validRow, validCol, | ||
| config.has_value() | ||
| ? static_cast<Attribute>(*config) | ||
| : Attribute()); | ||
| slotBufs.push_back(slot.getResult()); | ||
| } |
There was a problem hiding this comment.
The pass hoists pto.pointer_cast operations outside the enclosing scf.for loop without verifying that all operands (addresses, validRow, validCol) are loop-invariant. If any operand is defined inside the loop, this hoisting will result in invalid IR where a value is used before its definition. You should check if all operands are defined outside the loop before attempting to hoist.
| bool hasRelationSlots = | ||
| !e->relationOtherBuffers.empty() && | ||
| llvm::all_of(e->relationOtherBuffers, [](StorageEntry *s) { | ||
| return s->bitsOffset != 0; | ||
| }); |
There was a problem hiding this comment.
Using bitsOffset != 0 as a check to determine if a StorageEntry has been planned is unreliable because 0 is a valid planned offset (e.g., for the first buffer in a memory space or for zero-sized entries as seen in line 2052). This ambiguity can cause hasRelationSlots to be incorrectly false, leading to skipped conflict checks in VerifyConflictStage1. Consider using a sentinel value like std::numeric_limits<uint64_t>::max() or an explicit boolean flag to track whether an entry has been planned.
| llvm_unreachable("The buffer memory has been released and cannot be " | ||
| "used again before it is redefined! "); |
There was a problem hiding this comment.
Codex Review该评论由 review 机器人自动更新。
SummaryPR #615 introduces wrong-code risks in multi-buffer lowering and dynamic event-id selection, plus a regression in InsertSync's multi-buffer dedup path. Findings
The new pipeline only runs
|
PR #615 added the SyncCodegen scaffolding for multi-buffer set/wait_flag_dyn ops, but the upstream analysis path was inert: GetEventIdNum hard-coded return 1, and PTOIRTranslator dropped every pto.pointer_cast addr operand beyond the first. The N>1 multi-buffer code path was therefore unreachable. This change makes the PlanMemory -> InsertSync handshake actually work: - PTOIRTranslator::UpdatePointerCastOpMemInfo now translates *all* N pto.pointer_cast i64 addr operands into BaseMemInfo.baseAddresses (constants -> bit offsets, non-constants -> hasVariableAddress=true). Subview/alias deltaOffset is applied to every slot, not just slot 0. - MemoryDependentAnalyzer gains getMultiBufferSlotCount(a, b) implementing HIVM's per-slot geometry check: same-index slots must overlap (real back-edge dep on the same physical buffer) and different-index slots must NOT overlap (so consecutive iterations land in disjoint physical buffers). - InsertSyncAnalysis::GetEventIdNum is rewritten to follow HIVM semantics: every dependent pair must be multi-buffer-eligible, all pairs must agree on N, and every involved buffer must hang off the same scf.for. Verified end-to-end: a `pto.multi_buffer = 2` alloc inside scf.for now emits 2 reserved event ids, an `iv mod 2` arith.select chain, and `pto.{set,wait}_flag_dyn` ops driven by the selected idx. New regression guards this in test/lit/pto/multi_buffer_insert_sync_dyn_event_id.pto. Co-Authored-By: Claude <noreply@anthropic.com>
Builds on P0 to make the multi-buffer pipeline robust under realistic pressure: - SyncEventIdAllocation: HIVM-style TryFallbackMultiBuffer. When the event-id pool can't satisfy N, fall back to N=2 first (when N is even and >2) then to 1, instead of silently returning with an empty eventIds vec. Both set/wait siblings get their `eventIdNum` updated in lockstep so SyncCodegen takes the same code path on each side. - PTOPlanMemory: add `StorageEntry::isMultiBufferSlot` flag and rewrite `UpdateBuffer2Offsets` to walk via primaries and emit slots in declared `relationOtherBuffers` order. This makes the slot-ordering contract that EnableMultiBuffer relies on (`offsets[i]` <-> iv mod N selector index `i`) explicit and verifies it via a runtime assertion. - PTOEnableMultiBuffer: add scope guard (skip non-VEC/MAT casts where the iv-mod-N selector is not meaningful) and loop-invariance guard (skip casts whose addrs are not loop-invariant - hoisting them above the for loop would break SSA dominance). - New regression at test/lit/pto/multi_buffer_n4_insert_sync.pto: N=4 slot ordering [0, 1024, 2048, 3072] in pto.pointer_cast, 3-way arith.select chain, and 4 distinct event ids drained at function end. Co-Authored-By: Claude <noreply@anthropic.com>
Final hardening pass tightening multi-buffer correctness around the edges
PR-615 and earlier P0/P1 commits left ambiguous.
- InsertSyncAnalysis: GetEventIdNum now takes the back-edge scf.for and
requires every involved buffer's enclosing loop to *equal* that loop.
Previously a multi-buffer alloc nested inside an inner loop could
silently rotate slots on the wrong induction variable when a back-edge
on an outer loop carried the dep. Forward deps unconditionally collapse
to single-buffer.
- PTOEnableMultiBuffer: cache `iv mod N` per (loop, N) so several
multi-buffer pointer_casts sharing a loop reuse one counter, mirroring
SyncCodegen::loop2BufferCounter. Avoids redundant arith.remui +
constant ops in the loop body.
- PTOPlanMemory:
* `MemPlan::emitMultiBufferError` + `multiBufferDiagnosticEmitted_`
flag converts multi-buffer-specific `report_fatal_error` calls
(slot-order mismatch, pong outline placement failure) into
diagnostics that bubble through `plan()` as `failure()`. This lets
the existing PlanMemoryPass retry loop re-seed instead of aborting
the compiler under heavy multi-buffer memory pressure.
* `VerifyConflictStage1` now enumerates *every* historical
multi-buffer slot offset (via `CollectMultiRelationPongAnchors`)
as a SPEC_LEVEL_1 reuse anchor candidate. PR-615 only used the last
slot, silently dropping reuse for N > 2.
- New regression at test/lit/pto/multi_buffer_nested_loop.pto: nested
scf.for with multi-buffer alloc in the inner loop must rotate slots on
the inner iv, not the outer one.
Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d7db3f876e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| auto parentLoop1 = GetBufferParentLoop(e1->inplaceBuffers); | ||
| auto parentLoop2 = GetBufferParentLoop(e2->inplaceBuffers); | ||
| if (parentLoop1 != parentLoop2) { | ||
| return false; | ||
| } | ||
| return true; |
There was a problem hiding this comment.
Require actual pipe conflict in SPEC_LEVEL_2 filtering
VerifyConflictStage2 says it should reject reuse only when buffers both share a parent loop and have a DMA pipe conflict, but PipeConflictInSameLoop currently returns true for any pair in the same loop (including both top-level buffers with nullptr parent loops). This turns almost all same-scope candidates into conflicts, which can unnecessarily block reuse and cause local-memory planning to fail/overflow in cases that previously fit.
Useful? React with 👍 / 👎.
| PointerCastOp slot = rewriter.create<PointerCastOp>( | ||
| loc, resTy, oneAddr, validRow, validCol, | ||
| config.has_value() |
There was a problem hiding this comment.
Check hoisted pointer-cast dims are loop-invariant
The multi-buffer lowering hoists new single-address pto.pointer_cast ops before the loop, but those ops still consume valid_row/valid_col. Only addrs are checked for loop-invariance, so if either valid-dim operand is defined inside the loop (e.g., loop-dependent dynamic bounds), the transformed IR will reference out-of-scope values and break SSA correctness. The pass should either validate these operands are defined outside the loop or keep the cast inside the loop when they are not.
Useful? React with 👍 / 👎.
`VerifyConflictStage2`'s contract (per its own comment) is "block reuse only when buffers (a) share a parent loop AND (b) have a DMA pipe conflict". `PipeConflictInSameLoop` violated that on two counts: 1. After confirming same parent loop, it returned true unconditionally instead of querying `PipeConflict` for the actual DMA pipe-conflict relation. SPEC_LEVEL_2 was therefore stricter than SPEC_LEVEL_3 semantically: any same-loop pair was rejected, even when no DMA pipe conflict existed. 2. `GetBufferParentLoop` returns nullptr for top-level buffers; two top-level buffers compared parentLoop1 == parentLoop2 == nullptr, so the "different loops -> allow reuse" early-return was bypassed and the function fell through to "return true". Almost every cross-buffer pair at function scope was getting marked as conflicting, blocking valid reuse and causing local-memory planning to fail/overflow in cases that previously fit. Fix: reject the nullptr case explicitly, then fall through to the real `PipeConflict` query before declaring a conflict. 166/166 lit tests still pass; the change can only widen the set of allowed reuses, never narrow it. Co-Authored-By: Claude <noreply@anthropic.com>
After merging upstream/main, multi-buffer-enabled loops emitted spurious `pto.barrier <PIPE_MTE2>` and `pto.barrier <PIPE_MTE3>` ops in addition to the multi-buffer `set_flag_dyn` / `wait_flag_dyn` pair, e.g.: pto.wait_flag_dyn[<PIPE_MTE3>, <PIPE_MTE2>, %2] pto.barrier <PIPE_MTE2> <- redundant pto.tload ; (MTE2) pto.set_flag[<PIPE_MTE2>, <PIPE_MTE3>, <EVENT_ID0>] pto.wait_flag[<PIPE_MTE2>, <PIPE_MTE3>, <EVENT_ID0>] pto.barrier <PIPE_MTE3> <- redundant pto.tstore ; (MTE3) pto.set_flag_dyn[<PIPE_MTE3>, <PIPE_MTE2>, %2] Both barriers came from `InsertSyncOperation`'s same-pipe back-edge branch unconditionally emitting `PIPE_BARRIER` for any cross-iteration dep on a single pipe. Two of these were truly redundant: 1. Same-pipe back-edge whose dep is multi-buffer eligible. The whole point of multi-buffer is that consecutive iterations land in disjoint physical slots, so the cross-iter "same-address" dep is fundamentally false; the multi-buffer dyn flag pair already does the cross-iter ordering. No barrier needed. 2. Same-pipe back-edge on a DMA pipe (MTE1/MTE2/MTE3/MTE4/MTE5). DMA pipes are simple in-order command queues; the hardware itself serializes consecutive ops on the same DMA pipe across iterations. PIPE_BARRIER on a DMA pipe is a no-op and just clutters the IR. Fix: in `InsertSyncOperation`, compute the multi-buffer eventIdNum once (also resolves the back-edge scf.for so the cross-pipe path can reuse it - small refactor of the existing logic) and skip the barrier when either condition above holds. PIPE_M / PIPE_V keep the conservative PIPE_BARRIER to preserve the "bar_m" / "bar_v" intra-pipe semantics that higher-level frontends rely on (the existing comment in `IsNoNeedToInsertSync` calls this out). 168/168 lit pass, including all multi-buffer regressions and the upstream-merged comm/sync tests. Co-Authored-By: Claude <noreply@anthropic.com>
|
/run all |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
…ehmann-807dab # Conflicts: # lib/PTO/Transforms/CMakeLists.txt # tools/ptoas/ptoas.cpp
Wire multi-buffer support into the GraphSyncSolver path so it matches the
behaviour the InsertSync path got in earlier P0/P1/P2 commits. Mirrors the
intra-core subset of `bishengir/lib/Dialect/HIVM/Transforms/GraphSyncSolver`
(`SyncSolver::getEventIdInfo`, `getMultiBufferEventIdInfo`,
`getMultiBufferLoop`, `SyncSolverCodeGen::getMultiBufferSelectOp`).
Detection (Utility.h, SyncSolver.h/.cpp):
- `EventIdInfo` (eventIdNum + multibufferLoop) carried per ConflictPair.
- `getMultiBufferLoop` finds the common scf.for whose iv drives the slot
rotation; reuses the per-slot overlap helper added in P0
(MemoryDependentAnalyzer::getMultiBufferSlotCount), so the same
same-index-overlap / different-index-disjoint geometry rule applies.
- `getMultiBufferEventIdInfo` requires every dependent pair to agree on N
and share the same scf.for; failure -> single-buffer.
- `getEventIdInfo` is the top-level wrapper with the backward-only gate
(mirrors HIVM's "only optimise back-edge deps").
Solver wiring (handleSetWaitConflict):
- Threads rwOp1/rwOp2 in so MB info is computable.
- Allocates N event ids via the existing Welsh-Powell coloring node
(EventIdNode already supports eventIdNum > 1).
- Falls back to N=1, then PIPE_ALL barrier, if N ids cannot be coloured.
Codegen (SyncSolverCodeGen.h/.cpp):
- New `emitMultiBufferSetWait` emits the HIVM/InsertSync output shape:
pre-loop N pto.set_flag (queue prime), in-body iv-mod-N counter +
N-way arith.select chain + pto.{wait,set}_flag_dyn, post-loop N
pto.wait_flag (queue drain). The dyn ops live INSIDE the loop body so
they share the selector's dominance (GSS's default backward-sync
anchors are at the loop boundary, which works for single-buffer but
not for a per-iteration selector).
- `loop2BufferCounter_` cache reuses one `iv mod N` across multiple
ConflictPairs sharing a (loop, N) tuple.
Test:
- New `multi_buffer_gss_dyn_event_id.pto` is the GSS counterpart of the
existing InsertSync N=2 regression; checks pre-loop primes, the
selector chain, dyn set/wait, and post-loop drains.
191/191 lit pass.
Co-Authored-By: Claude <noreply@anthropic.com>
Removed references to AscendNPU-IR and clarified the design scope for PTOAS PR615 multi-buffer.
A3 板测失败
日志尾部 |
No description provided.