Skip to content

Relax unnecessary src/dst storage restriction for scalar bitwise tile ops #614

@Zhendong404

Description

@Zhendong404

Problem

On main branch, pto.tands rejects IR where src and dst refer to the same tile storage:

'pto.tands' op expects src and dst to use different storage

The restriction appears to come from the shared verifier helper used by scalar bitwise tile ops:

static FailureOr<Type> verifyDistinctRowMajorUnaryTileOpCommon(
    Operation *op, Value src, Value dst, StringRef srcName = "src",
    StringRef dstName = "dst") {
  if (src == dst) {
    op->emitOpError("expects src and dst to use different storage");
    return failure();
  }
  ...
}

TAndSOp::verify, TOrSOp::verify, and TXorSOp::verify all call this helper.

Reproducer

module attributes {pto.target_arch = "a5"} {
  func.func @tands_inplace() {
    %tile = pto.alloc_tile
      : !pto.tile_buf<loc=vec, dtype=i32, rows=16, cols=64, v_row=16, v_col=64,
                      blayout=row_major, slayout=none_box, fractal=512, pad=0>
    %scalar = arith.constant 0xFF : i32
    pto.tands ins(%tile, %scalar : !pto.tile_buf<loc=vec, dtype=i32, rows=16, cols=64, v_row=16, v_col=64,
                                      blayout=row_major, slayout=none_box, fractal=512, pad=0>, i32)
              outs(%tile : !pto.tile_buf<loc=vec, dtype=i32, rows=16, cols=64, v_row=16, v_col=64,
                                blayout=row_major, slayout=none_box, fractal=512, pad=0>)
    return
  }
}

Command:

ptoas --pto-arch=a5 --pto-backend=vpto tands_inplace.pto -o -

Actual result:

loc("tands_inplace.pto":6:5): error: 'pto.tands' op expects src and dst to use different storage
Error: Failed to parse MLIR.

Why this restriction seems too strong

The IR-level restriction seems to reflect one possible C++/tile implementation strategy rather than an inherent semantic requirement of tands.

For example, an implementation that expands the scalar into dst first and then runs a binary tile op like:

TEXPANDS_IMPL(dst, scalar);
TAND_IMPL(dst, src, dst);

cannot safely support src == dst, because broadcasting the scalar into dst overwrites the original source before the and operation reads it.

However, tands can be implemented in a src/dst-reuse-safe way by loading source data before storing the result. The VPTO-style lowering naturally has this shape:

vec        = vlds(src)
scalar_vec = vbr(scalar)
result     = vand(vec, scalar_vec, mask)
vsts(result, dst)

With that lowering order, src == dst is well-defined: each vector chunk is read before the corresponding result is written back.

So the current verifier rejects a semantically valid in-place form because one backend implementation strategy is not in-place safe.

Expected behavior

pto.tands should allow src and dst to reuse the same tile storage when the selected lowering/backend can implement it safely.

At minimum, the verifier should not unconditionally reject src == dst for pto.tands at the IR level.

The same question likely applies to pto.tors, and possibly pto.txors depending on the intended role of its tmp operand.

Suggested fix direction

  1. Revisit verifyDistinctRowMajorUnaryTileOpCommon usage for TAndSOp, TOrSOp, and TXorSOp.
  2. Split the verifier concerns:
    • Keep shape/layout/element-type checks in the common helper.
    • Do not make src != dst a universal IR invariant unless the op semantics truly require it.
  3. Update the C++/tile lowering path for tands/tors to use an in-place-safe sequence, or route these ops through the VPTO-style vlds + vbr + vand/vor + vsts lowering where available.
  4. Add regression tests for in-place scalar bitwise tile ops, e.g.:
    • pto.tands ins(%tile, %scalar) outs(%tile) should verify and lower on A5 VPTO.
    • pto.tors ins(%tile, %scalar) outs(%tile) should verify and lower if the same reasoning applies.

Impact

The current restriction also interacts badly with memory planning: if PlanMemory legally reuses two short-lived tile buffers by assigning the same local offset, later verification can fail with the same src/dst different storage diagnostic. Relaxing the IR-level restriction, or making lowering explicitly safe for in-place use, would avoid rejecting valid programs and reduce pressure on the memory planner to preserve an implementation-specific non-aliasing constraint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions