Problem
On main branch, pto.tands rejects IR where src and dst refer to the same tile storage:
'pto.tands' op expects src and dst to use different storage
The restriction appears to come from the shared verifier helper used by scalar bitwise tile ops:
static FailureOr<Type> verifyDistinctRowMajorUnaryTileOpCommon(
Operation *op, Value src, Value dst, StringRef srcName = "src",
StringRef dstName = "dst") {
if (src == dst) {
op->emitOpError("expects src and dst to use different storage");
return failure();
}
...
}
TAndSOp::verify, TOrSOp::verify, and TXorSOp::verify all call this helper.
Reproducer
module attributes {pto.target_arch = "a5"} {
func.func @tands_inplace() {
%tile = pto.alloc_tile
: !pto.tile_buf<loc=vec, dtype=i32, rows=16, cols=64, v_row=16, v_col=64,
blayout=row_major, slayout=none_box, fractal=512, pad=0>
%scalar = arith.constant 0xFF : i32
pto.tands ins(%tile, %scalar : !pto.tile_buf<loc=vec, dtype=i32, rows=16, cols=64, v_row=16, v_col=64,
blayout=row_major, slayout=none_box, fractal=512, pad=0>, i32)
outs(%tile : !pto.tile_buf<loc=vec, dtype=i32, rows=16, cols=64, v_row=16, v_col=64,
blayout=row_major, slayout=none_box, fractal=512, pad=0>)
return
}
}
Command:
ptoas --pto-arch=a5 --pto-backend=vpto tands_inplace.pto -o -
Actual result:
loc("tands_inplace.pto":6:5): error: 'pto.tands' op expects src and dst to use different storage
Error: Failed to parse MLIR.
Why this restriction seems too strong
The IR-level restriction seems to reflect one possible C++/tile implementation strategy rather than an inherent semantic requirement of tands.
For example, an implementation that expands the scalar into dst first and then runs a binary tile op like:
TEXPANDS_IMPL(dst, scalar);
TAND_IMPL(dst, src, dst);
cannot safely support src == dst, because broadcasting the scalar into dst overwrites the original source before the and operation reads it.
However, tands can be implemented in a src/dst-reuse-safe way by loading source data before storing the result. The VPTO-style lowering naturally has this shape:
vec = vlds(src)
scalar_vec = vbr(scalar)
result = vand(vec, scalar_vec, mask)
vsts(result, dst)
With that lowering order, src == dst is well-defined: each vector chunk is read before the corresponding result is written back.
So the current verifier rejects a semantically valid in-place form because one backend implementation strategy is not in-place safe.
Expected behavior
pto.tands should allow src and dst to reuse the same tile storage when the selected lowering/backend can implement it safely.
At minimum, the verifier should not unconditionally reject src == dst for pto.tands at the IR level.
The same question likely applies to pto.tors, and possibly pto.txors depending on the intended role of its tmp operand.
Suggested fix direction
- Revisit
verifyDistinctRowMajorUnaryTileOpCommon usage for TAndSOp, TOrSOp, and TXorSOp.
- Split the verifier concerns:
- Keep shape/layout/element-type checks in the common helper.
- Do not make
src != dst a universal IR invariant unless the op semantics truly require it.
- Update the C++/tile lowering path for
tands/tors to use an in-place-safe sequence, or route these ops through the VPTO-style vlds + vbr + vand/vor + vsts lowering where available.
- Add regression tests for in-place scalar bitwise tile ops, e.g.:
pto.tands ins(%tile, %scalar) outs(%tile) should verify and lower on A5 VPTO.
pto.tors ins(%tile, %scalar) outs(%tile) should verify and lower if the same reasoning applies.
Impact
The current restriction also interacts badly with memory planning: if PlanMemory legally reuses two short-lived tile buffers by assigning the same local offset, later verification can fail with the same src/dst different storage diagnostic. Relaxing the IR-level restriction, or making lowering explicitly safe for in-place use, would avoid rejecting valid programs and reduce pressure on the memory planner to preserve an implementation-specific non-aliasing constraint.
Problem
On main branch,
pto.tandsrejects IR wheresrcanddstrefer to the same tile storage:The restriction appears to come from the shared verifier helper used by scalar bitwise tile ops:
TAndSOp::verify,TOrSOp::verify, andTXorSOp::verifyall call this helper.Reproducer
Command:
Actual result:
Why this restriction seems too strong
The IR-level restriction seems to reflect one possible C++/tile implementation strategy rather than an inherent semantic requirement of
tands.For example, an implementation that expands the scalar into
dstfirst and then runs a binary tile op like:cannot safely support
src == dst, because broadcasting the scalar intodstoverwrites the original source before theandoperation reads it.However,
tandscan be implemented in a src/dst-reuse-safe way by loading source data before storing the result. The VPTO-style lowering naturally has this shape:With that lowering order,
src == dstis well-defined: each vector chunk is read before the corresponding result is written back.So the current verifier rejects a semantically valid in-place form because one backend implementation strategy is not in-place safe.
Expected behavior
pto.tandsshould allowsrcanddstto reuse the same tile storage when the selected lowering/backend can implement it safely.At minimum, the verifier should not unconditionally reject
src == dstforpto.tandsat the IR level.The same question likely applies to
pto.tors, and possiblypto.txorsdepending on the intended role of itstmpoperand.Suggested fix direction
verifyDistinctRowMajorUnaryTileOpCommonusage forTAndSOp,TOrSOp, andTXorSOp.src != dsta universal IR invariant unless the op semantics truly require it.tands/torsto use an in-place-safe sequence, or route these ops through the VPTO-stylevlds + vbr + vand/vor + vstslowering where available.pto.tands ins(%tile, %scalar) outs(%tile)should verify and lower on A5 VPTO.pto.tors ins(%tile, %scalar) outs(%tile)should verify and lower if the same reasoning applies.Impact
The current restriction also interacts badly with memory planning: if
PlanMemorylegally reuses two short-lived tile buffers by assigning the same local offset, later verification can fail with the samesrc/dst different storagediagnostic. Relaxing the IR-level restriction, or making lowering explicitly safe for in-place use, would avoid rejecting valid programs and reduce pressure on the memory planner to preserve an implementation-specific non-aliasing constraint.