Skip to content

[AIROCMLIR-825] Correct kernelId calculation for backwards_data_conv ops#2363

Open
justinrosner wants to merge 2 commits intodevelopfrom
825-port-changes
Open

[AIROCMLIR-825] Correct kernelId calculation for backwards_data_conv ops#2363
justinrosner wants to merge 2 commits intodevelopfrom
825-port-changes

Conversation

@justinrosner
Copy link
Copy Markdown
Contributor

@justinrosner justinrosner commented May 6, 2026

Motivation

This is a port of https://github.com/ROCm/rocmlirTriton/pull/188 from
rocmlirTriton. Because rocMLIR still has the V4R1 flag (rocmlirTriton has removed it) and a separate copy of the bwd-data slice-extent math, the port carries four rocMLIR-only changes on top of that PR':

  1. commonConvRewrite (mlir/lib/Dialect/Rock/Transforms/ConvToGemm.cpp) iterates the filtered ids. The pre-port usesV4R1=false arm walked numKernels.size() and passed the loop index i as the kernel id, so e.g. ids {0, 3} were lowered as {0, 1} and the second GEMM was an invalid kernel. The port computes kernelIds once and iterates for (int64_t kernelId : kernelIds). rocmlirTriton does not need this because its bwd-data lowering already iterates filtered ids through a different backwardDataGemmForKernelId flow.
  2. commonConvRewrite validates the user-supplied id on the V4R1 path. When usesV4R1=true we do not iterate backwardDataKernelIds, we dispatch a single GEMM for the user-supplied --kernel_id. Without validation, an out-of-range id would silently flow into backwardDataV4R1 where the unsigned llvm::divideCeil would wrap a negative numerator into a multi-exabyte slice extent. The port intersects the supplied id with backwardDataKernelIds(...) and emits an error when it is out of range. rocmlirTriton has no V4R1 path, so this is rocMLIR-specific.
  3. backwardDataV4R1 slice-extent math switched to llvm::divideCeil + precondition assert. backwardDataV4R1 independently re-derives iDotSlice for its own use. The previous code used signed math_util::integer_divide_ceil, which papered over the missing input validation by harmlessly producing 0 for negative numerators. Switching to llvm::divideCeil matches the slice math in backwardDataKernelIds and makes the precondition explicit (assert(iTilda[i] < convDims.fil[i] && "kernelId not pre-filtered by backwardDataKernelIds")). Combined with 2. this guarantees that a malformed id is a hard, diagnosable failure in both debug and release builds rather than a silent OOB.
  4. New lit test mlir/test/Dialect/Rock/conv_to_gemm_bwd_data_empty_filter_slice.mlir. Pins the lowering behavior for all three of the deltas above with three sub-checks against --mlir-print-ir-after=rock-conv-to-gemm.

The remaining changes; the TransposeConvStridedConverter low-side crop fix, the rocmlir-custom-tosa-decompose.mlir lit additions, andthe mixr-bwd-data-conv-empty-filter-slice.mlir E2E coverage are direct ports of the rocmlirTriton change set with no semantic divergence.

This PR fixes two small bugs that were found when running the parameterSweeps for rocmlirTriton. Fixes: https://amd-hub.atlassian.net/browse/AIROCMLIR-825

Technical Details

  • Changes in loweringUtils:
    • backwardsDataKernelsIds enumerates the per-stride phase GEMMs that make up a single strided backwards data conv op, and skips any phase whose filter slice is empty.
    • The previous implementation was expressing the slice size as llvm::divideCeil(filterDims[i] - iTilda[i], filTilda[i]), and when filTilda[i] exceeds filterDims[i], iTilda[i] can grow larger than filterDims[i] making the numerator negative. When this would happen we would end up with a large positive "slice size" and cause an invalid kernel ID to be emitted as if it had real work to do
    • The fix detects iTilda[i] >= filterDims[i] explicitly, sets gemmKproduct = 0, and breaks out of the loop, so empty phases are correctly excluded.

While working on these changes I uncovered a secondary issue in the CPU lowering:

  • The TransposeConvStridedConverter lowers a strided bwd-data conv into a stride-1 conv by factoring stride phases out of the filter channels, running the stride-1 conv on a padded gradient-output, and then re-interleaving the phases into an expanded result that has to be cropped/padded into the final shape
  • The old low-side crop calculation was effectively oldCrop = padLow * stride - outPadLow - (K - stride - 1), where K is the effective filter size after dilation but before padding the filter up to a multiple of stride
  • The corrected geometry uses the reduced stride-1 filter size kPrime = ceil(K / stride): output position 0 maps to expanded-conv position correctOffset = padLow * (stride + 1) - stride * (kPrime - 1) - outPadLow.
  • Comparing the two formulas before clamping gives correctOffset - oldCrop = K + padLow - 1 - stride * kPrime. For example, in the failing height dimension, K=2, stride=2, kPrime=1, padLow=0, and outPadLow=0, so the old formula produced oldCrop = 1 while the correct formula gives correctOffset = 0; the old lowering incorrectly dropped the first row of the expanded result.

A new e2e test was added that exercises both fixes. The first RUN line drives the original MIGraphX bwd-data conv that initially surfaced the bug through the migraphx -> TOSA -> linalg CPU lowering, which exercises the TransposeConvStridedConverter offset fix. However, that CPU lowering does not currently support group > 1, so it can only verify a group=1 reduction of the original problem. To make sure the actual failing shape (groupsize=4) is also covered, a second RUN line invokes rocmlir-gen --operation conv_bwd_data directly with groupsize=4, which drives the rock kernel-ID path and exercises the backwardDataKernelIds fix end-to-end on the GPU. I've run all the other existing bwd_conv tests and they are passing with no problems.

Test Plan

Run Nightly and PR CI

Test Result

  • Nightly CI
  • PR CI

Submission Checklist

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes backward-data convolution lowering around empty filter slices and strided transpose-conv cropping, touching both Rock lowering and the custom TOSA decomposition path.

Changes:

  • Filter invalid backward-data kernel IDs so empty stride phases are skipped, and use the actual filtered IDs when lowering non-V4R1 bwd-data convs.
  • Replace the old low-side crop heuristic in TransposeConvStridedConverter with an explicit offset-based calculation, including a zero-overlap fallback.
  • Add Rock, TOSA-decompose, and e2e regression tests for the empty-filter-slice and CPU offset cases.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
mlir/test/fusion/pr-e2e/mixr-bwd-data-conv-empty-filter-slice.mlir Adds end-to-end regression coverage for the reported backward-data conv case.
mlir/test/Dialect/Rock/conv_to_gemm_bwd_data_empty_filter_slice.mlir Adds pass-level checks that empty filter slices no longer emit phantom rock.gemm ops.
mlir/test/Conversion/RocmlirCustomTosaDecompose/rocmlir-custom-tosa-decompose.mlir Adds regression checks for the strided transpose-conv offset and zero-overlap CPU lowering cases.
mlir/lib/Dialect/Rock/utility/loweringUtils.cpp Filters out kernel IDs whose filter slice is empty before returning the per-phase ID list.
mlir/lib/Dialect/Rock/Transforms/ConvToGemm.cpp Uses filtered kernel IDs in non-V4R1 lowering and tightens the slice calculation in backwardDataV4R1.
mlir/lib/Conversion/RocmlirCustomTosaDecompose/RocmlirCustomTosaDecompose.cpp Reworks low-side offset/crop handling and adds safe zero-result materialization for empty overlap.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlir/lib/Dialect/Rock/Transforms/ConvToGemm.cpp
Comment thread mlir/test/fusion/pr-e2e/mixr-bwd-data-conv-empty-filter-slice.mlir
@justinrosner
Copy link
Copy Markdown
Contributor Author

justinrosner commented May 6, 2026

I did the AI review for the rocmlirTriton PR, but based on the Copilot comments it seems like there are some slight changes in rocMLIR that will require an additional review here (because of the additional v4r1 behavior here)

UPDATE: Ready for review

@justinrosner justinrosner marked this pull request as draft May 6, 2026 02:39
@justinrosner justinrosner marked this pull request as ready for review May 6, 2026 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants