[AIROCMLIR-825] Correct kernelId calculation for backwards_data_conv ops#2363
Open
justinrosner wants to merge 2 commits intodevelopfrom
Open
[AIROCMLIR-825] Correct kernelId calculation for backwards_data_conv ops#2363justinrosner wants to merge 2 commits intodevelopfrom
justinrosner wants to merge 2 commits intodevelopfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes backward-data convolution lowering around empty filter slices and strided transpose-conv cropping, touching both Rock lowering and the custom TOSA decomposition path.
Changes:
- Filter invalid backward-data kernel IDs so empty stride phases are skipped, and use the actual filtered IDs when lowering non-V4R1 bwd-data convs.
- Replace the old low-side crop heuristic in
TransposeConvStridedConverterwith an explicit offset-based calculation, including a zero-overlap fallback. - Add Rock, TOSA-decompose, and e2e regression tests for the empty-filter-slice and CPU offset cases.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
mlir/test/fusion/pr-e2e/mixr-bwd-data-conv-empty-filter-slice.mlir |
Adds end-to-end regression coverage for the reported backward-data conv case. |
mlir/test/Dialect/Rock/conv_to_gemm_bwd_data_empty_filter_slice.mlir |
Adds pass-level checks that empty filter slices no longer emit phantom rock.gemm ops. |
mlir/test/Conversion/RocmlirCustomTosaDecompose/rocmlir-custom-tosa-decompose.mlir |
Adds regression checks for the strided transpose-conv offset and zero-overlap CPU lowering cases. |
mlir/lib/Dialect/Rock/utility/loweringUtils.cpp |
Filters out kernel IDs whose filter slice is empty before returning the per-phase ID list. |
mlir/lib/Dialect/Rock/Transforms/ConvToGemm.cpp |
Uses filtered kernel IDs in non-V4R1 lowering and tightens the slice calculation in backwardDataV4R1. |
mlir/lib/Conversion/RocmlirCustomTosaDecompose/RocmlirCustomTosaDecompose.cpp |
Reworks low-side offset/crop handling and adds safe zero-result materialization for empty overlap. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
Author
|
I did the AI review for the rocmlirTriton PR, but based on the Copilot comments it seems like there are some slight changes in rocMLIR that will require an additional review here (because of the additional v4r1 behavior here) UPDATE: Ready for review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This is a port of https://github.com/ROCm/rocmlirTriton/pull/188 from
rocmlirTriton. Because rocMLIR still has the V4R1 flag (rocmlirTriton has removed it) and a separate copy of the bwd-data slice-extent math, the port carries four rocMLIR-only changes on top of that PR':
commonConvRewrite(mlir/lib/Dialect/Rock/Transforms/ConvToGemm.cpp) iterates the filtered ids. The pre-portusesV4R1=falsearm walkednumKernels.size()and passed the loop indexias the kernel id, so e.g. ids{0, 3}were lowered as{0, 1}and the second GEMM was an invalid kernel. The port computeskernelIdsonce and iteratesfor (int64_t kernelId : kernelIds). rocmlirTriton does not need this because its bwd-data lowering already iterates filtered ids through a differentbackwardDataGemmForKernelIdflow.commonConvRewritevalidates the user-supplied id on the V4R1 path. WhenusesV4R1=truewe do not iteratebackwardDataKernelIds, we dispatch a single GEMM for the user-supplied--kernel_id. Without validation, an out-of-range id would silently flow intobackwardDataV4R1where the unsignedllvm::divideCeilwould wrap a negative numerator into a multi-exabyte slice extent. The port intersects the supplied id withbackwardDataKernelIds(...)and emits an error when it is out of range. rocmlirTriton has no V4R1 path, so this is rocMLIR-specific.backwardDataV4R1slice-extent math switched tollvm::divideCeil+ precondition assert.backwardDataV4R1independently re-derivesiDotSlicefor its own use. The previous code used signedmath_util::integer_divide_ceil, which papered over the missing input validation by harmlessly producing0for negative numerators. Switching tollvm::divideCeilmatches the slice math inbackwardDataKernelIdsand makes the precondition explicit (assert(iTilda[i] < convDims.fil[i] && "kernelId not pre-filtered by backwardDataKernelIds")). Combined with 2. this guarantees that a malformed id is a hard, diagnosable failure in both debug and release builds rather than a silent OOB.mlir/test/Dialect/Rock/conv_to_gemm_bwd_data_empty_filter_slice.mlir. Pins the lowering behavior for all three of the deltas above with three sub-checks against--mlir-print-ir-after=rock-conv-to-gemm.The remaining changes; the
TransposeConvStridedConverterlow-side crop fix, therocmlir-custom-tosa-decompose.mlirlit additions, andthemixr-bwd-data-conv-empty-filter-slice.mlirE2E coverage are direct ports of the rocmlirTriton change set with no semantic divergence.This PR fixes two small bugs that were found when running the parameterSweeps for rocmlirTriton. Fixes: https://amd-hub.atlassian.net/browse/AIROCMLIR-825
Technical Details
loweringUtils:backwardsDataKernelsIdsenumerates the per-stride phase GEMMs that make up a single strided backwards data conv op, and skips any phase whose filter slice is empty.llvm::divideCeil(filterDims[i] - iTilda[i], filTilda[i]), and whenfilTilda[i]exceedsfilterDims[i],iTilda[i]can grow larger thanfilterDims[i]making the numerator negative. When this would happen we would end up with a large positive "slice size" and cause an invalid kernel ID to be emitted as if it had real work to doTilda[i] >= filterDims[i]explicitly, setsgemmKproduct = 0, and breaks out of the loop, so empty phases are correctly excluded.While working on these changes I uncovered a secondary issue in the CPU lowering:
TransposeConvStridedConverterlowers a strided bwd-data conv into a stride-1 conv by factoring stride phases out of the filter channels, running the stride-1 conv on a padded gradient-output, and then re-interleaving the phases into an expanded result that has to be cropped/padded into the final shapeA new e2e test was added that exercises both fixes. The first RUN line drives the original MIGraphX bwd-data conv that initially surfaced the bug through the migraphx -> TOSA -> linalg CPU lowering, which exercises the
TransposeConvStridedConverteroffset fix. However, that CPU lowering does not currently support group > 1, so it can only verify a group=1 reduction of the original problem. To make sure the actual failing shape (groupsize=4) is also covered, a second RUN line invokes rocmlir-gen --operation conv_bwd_data directly with groupsize=4, which drives the rock kernel-ID path and exercises the backwardDataKernelIds fix end-to-end on the GPU. I've run all the other existing bwd_conv tests and they are passing with no problems.Test Plan
Run Nightly and PR CI
Test Result
Submission Checklist