[AIROCMLIR-825] Correct kernelId calculation for backwards_data_conv ops by justinrosner · Pull Request #2363 · ROCm/rocMLIR

justinrosner · 2026-05-06T02:24:43Z

Motivation

This is a port of https://github.com/ROCm/rocmlirTriton/pull/188 from
rocmlirTriton. Because rocMLIR still has the V4R1 flag (rocmlirTriton has removed it) and a separate copy of the bwd-data slice-extent math, the port carries four rocMLIR-only changes on top of that PR':

commonConvRewrite (mlir/lib/Dialect/Rock/Transforms/ConvToGemm.cpp) iterates the filtered ids. The pre-port usesV4R1=false arm walked numKernels.size() and passed the loop index i as the kernel id, so e.g. ids {0, 3} were lowered as {0, 1} and the second GEMM was an invalid kernel. The port computes kernelIds once and iterates for (int64_t kernelId : kernelIds). rocmlirTriton does not need this because its bwd-data lowering already iterates filtered ids through a different backwardDataGemmForKernelId flow.
commonConvRewrite validates the user-supplied id on the V4R1 path. When usesV4R1=true we do not iterate backwardDataKernelIds, we dispatch a single GEMM for the user-supplied --kernel_id. Without validation, an out-of-range id would silently flow into backwardDataV4R1 where the unsigned llvm::divideCeil would wrap a negative numerator into a multi-exabyte slice extent. The port intersects the supplied id with backwardDataKernelIds(...) and emits an error when it is out of range. rocmlirTriton has no V4R1 path, so this is rocMLIR-specific.
backwardDataV4R1 slice-extent math switched to llvm::divideCeil + precondition assert. backwardDataV4R1 independently re-derives iDotSlice for its own use. The previous code used signed math_util::integer_divide_ceil, which papered over the missing input validation by harmlessly producing 0 for negative numerators. Switching to llvm::divideCeil matches the slice math in backwardDataKernelIds and makes the precondition explicit (assert(iTilda[i] < convDims.fil[i] && "kernelId not pre-filtered by backwardDataKernelIds")). Combined with 2. this guarantees that a malformed id is a hard, diagnosable failure in both debug and release builds rather than a silent OOB.
New lit test mlir/test/Dialect/Rock/conv_to_gemm_bwd_data_empty_filter_slice.mlir. Pins the lowering behavior for all three of the deltas above with three sub-checks against --mlir-print-ir-after=rock-conv-to-gemm.

The remaining changes; the TransposeConvStridedConverter low-side crop fix, the rocmlir-custom-tosa-decompose.mlir lit additions, andthe mixr-bwd-data-conv-empty-filter-slice.mlir E2E coverage are direct ports of the rocmlirTriton change set with no semantic divergence.

This PR fixes two small bugs that were found when running the parameterSweeps for rocmlirTriton. Fixes: https://amd-hub.atlassian.net/browse/AIROCMLIR-825

Technical Details

Changes in loweringUtils:
- backwardsDataKernelsIds enumerates the per-stride phase GEMMs that make up a single strided backwards data conv op, and skips any phase whose filter slice is empty.
- The previous implementation was expressing the slice size as llvm::divideCeil(filterDims[i] - iTilda[i], filTilda[i]), and when filTilda[i] exceeds filterDims[i], iTilda[i] can grow larger than filterDims[i] making the numerator negative. When this would happen we would end up with a large positive "slice size" and cause an invalid kernel ID to be emitted as if it had real work to do
- The fix detects iTilda[i] >= filterDims[i] explicitly, sets gemmKproduct = 0, and breaks out of the loop, so empty phases are correctly excluded.

While working on these changes I uncovered a secondary issue in the CPU lowering:

The TransposeConvStridedConverter lowers a strided bwd-data conv into a stride-1 conv by factoring stride phases out of the filter channels, running the stride-1 conv on a padded gradient-output, and then re-interleaving the phases into an expanded result that has to be cropped/padded into the final shape
The old low-side crop calculation was effectively oldCrop = padLow * stride - outPadLow - (K - stride - 1), where K is the effective filter size after dilation but before padding the filter up to a multiple of stride
The corrected geometry uses the reduced stride-1 filter size kPrime = ceil(K / stride): output position 0 maps to expanded-conv position correctOffset = padLow * (stride + 1) - stride * (kPrime - 1) - outPadLow.
Comparing the two formulas before clamping gives correctOffset - oldCrop = K + padLow - 1 - stride * kPrime. For example, in the failing height dimension, K=2, stride=2, kPrime=1, padLow=0, and outPadLow=0, so the old formula produced oldCrop = 1 while the correct formula gives correctOffset = 0; the old lowering incorrectly dropped the first row of the expanded result.

A new e2e test was added that exercises both fixes. The first RUN line drives the original MIGraphX bwd-data conv that initially surfaced the bug through the migraphx -> TOSA -> linalg CPU lowering, which exercises the TransposeConvStridedConverter offset fix. However, that CPU lowering does not currently support group > 1, so it can only verify a group=1 reduction of the original problem. To make sure the actual failing shape (groupsize=4) is also covered, a second RUN line invokes rocmlir-gen --operation conv_bwd_data directly with groupsize=4, which drives the rock kernel-ID path and exercises the backwardDataKernelIds fix end-to-end on the GPU. I've run all the other existing bwd_conv tests and they are passing with no problems.

Test Plan

Run Nightly and PR CI

Test Result

Nightly CI
PR CI

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

Fixes backward-data convolution lowering around empty filter slices and strided transpose-conv cropping, touching both Rock lowering and the custom TOSA decomposition path.

Changes:

Filter invalid backward-data kernel IDs so empty stride phases are skipped, and use the actual filtered IDs when lowering non-V4R1 bwd-data convs.
Replace the old low-side crop heuristic in TransposeConvStridedConverter with an explicit offset-based calculation, including a zero-overlap fallback.
Add Rock, TOSA-decompose, and e2e regression tests for the empty-filter-slice and CPU offset cases.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`mlir/test/fusion/pr-e2e/mixr-bwd-data-conv-empty-filter-slice.mlir`	Adds end-to-end regression coverage for the reported backward-data conv case.
`mlir/test/Dialect/Rock/conv_to_gemm_bwd_data_empty_filter_slice.mlir`	Adds pass-level checks that empty filter slices no longer emit phantom `rock.gemm` ops.
`mlir/test/Conversion/RocmlirCustomTosaDecompose/rocmlir-custom-tosa-decompose.mlir`	Adds regression checks for the strided transpose-conv offset and zero-overlap CPU lowering cases.
`mlir/lib/Dialect/Rock/utility/loweringUtils.cpp`	Filters out kernel IDs whose filter slice is empty before returning the per-phase ID list.
`mlir/lib/Dialect/Rock/Transforms/ConvToGemm.cpp`	Uses filtered kernel IDs in non-V4R1 lowering and tightens the slice calculation in `backwardDataV4R1`.
`mlir/lib/Conversion/RocmlirCustomTosaDecompose/RocmlirCustomTosaDecompose.cpp`	Reworks low-side offset/crop handling and adds safe zero-result materialization for empty overlap.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

justinrosner · 2026-05-06T02:39:31Z

I did the AI review for the rocmlirTriton PR, but based on the Copilot comments it seems like there are some slight changes in rocMLIR that will require an additional review here (because of the additional v4r1 behavior here)

UPDATE: Ready for review

Port changes to rocMLIR

8bbd978

justinrosner requested review from Copilot, dhernandez0, erizheng-amd, pabloantoniom and umangyadav May 6, 2026 02:24

justinrosner requested a review from causten as a code owner May 6, 2026 02:24

Copilot started reviewing on behalf of justinrosner May 6, 2026 02:27 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

Comment thread mlir/lib/Dialect/Rock/Transforms/ConvToGemm.cpp

Comment thread mlir/test/fusion/pr-e2e/mixr-bwd-data-conv-empty-filter-slice.mlir

justinrosner marked this pull request as draft May 6, 2026 02:39

Attend to AI review comments

1ba6b3e

justinrosner marked this pull request as ready for review May 6, 2026 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIROCMLIR-825] Correct kernelId calculation for backwards_data_conv ops#2363

[AIROCMLIR-825] Correct kernelId calculation for backwards_data_conv ops#2363
justinrosner wants to merge 2 commits intodevelopfrom
825-port-changes

justinrosner commented May 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

justinrosner commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

justinrosner commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

justinrosner commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justinrosner commented May 6, 2026 •

edited

Loading

justinrosner commented May 6, 2026 •

edited

Loading