Add nightly regression test for FP4 scaled-GEMM tuning crash#2368
Open
umangyadav wants to merge 4 commits intodevelopfrom
Open
Add nightly regression test for FP4 scaled-GEMM tuning crash#2368umangyadav wants to merge 4 commits intodevelopfrom
umangyadav wants to merge 4 commits intodevelopfrom
Conversation
Pins the previously-crashing perfConfigs from ROCm/rocMLIR-internal#2124 so that any regression in the AMDGPU backend codegen path for FP4 scaled-GEMMs at optLevel=3 is caught by nightly CI on gfx950. The new directory mlir/test/rocmlir-tuning-driver/nightly/ is gated on: - the nightly CI configuration (config.enable_rock_driver_e2e_test), matching the convention in mlir/test/hipblaslt/lit.local.cfg, and - gfx950 (FP4 scaled-MFMA), matching the convention used by every other FP4 test in the repo (mlir/test/e2e/*F4.cfg, *Scaled*.cfg, mlir/test/fusion/pr-e2e/mixr-gemm-fp4/lit.local.cfg). The test runs three originally-crashing perfConfigs (one from the issue body, two from the comments) across two GEMM shapes via rocmlir-tuning-driver, which keeps optLevel=3 - the same setting that triggered the crash. Co-authored-by: Cursor <cursoragent@cursor.com>
justinrosner
reviewed
May 7, 2026
Contributor
There was a problem hiding this comment.
Q: Why not move this to GemmVariantsF4.toml and just add a new perf_config axis to that existing test?
Member
Author
There was a problem hiding this comment.
I could do that but this test is only looking for compilation success not really for E2E correctness. Let me know your thoughts. i can move it to GemmVariantsF4 as well but in future some arch may not be compatible with this particular perf configs
Contributor
There was a problem hiding this comment.
GemmVariantsF4 looks like it's already confined to just running on gfx950:
if not (hasattr(config, 'arch') and config.arch and "gfx950" in config.arch):
config.unsupported = TrueIf we can get a check that verifies compilation as well as E2E functionality, then I'm in favor of that
Per @justinrosner's review on #2368, replace the standalone rocmlir-tuning-driver-based regression test with a focused E2E toml (GemmScaledF4Regression) so the previously-crashing perfConfigs are exercised end-to-end (compile + run) on gfx950 instead of compile-only. Each [[suite.test]] pins one (shape, perf_config) pair from the original failing scenarios in ROCm/rocMLIR-internal#2124, giving exactly one generated test per scenario (no combinatorial explosion) while staying faithful to the scaled-GEMM context where the AMDGPU codegen crash originally occurred. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes https://github.com/ROCm/rocMLIR-internal/issues/2124
Adds a nightly-only regression test that pins the previously-crashing perfConfigs from
ROCm/rocMLIR-internal#2124so any regression in the AMDGPU backend codegen path for FP4 scaled-GEMMs atoptLevel=3is caught in nightly CI ongfx950.Background
Issue 2124 reported that
rocmlir-tuning-drivercrashed during compilation of FP4 scaled-GEMMs atoptLevel=3for specific perfConfigs (e.g.v3:128,256,8,32,16,32,1,1,2,1,1) ongfx950. The bug was in the AMDGPU backend (tracked as SWDEV-566229 / SWDEV-564678) and has since been picked up via upstream LLVM merges; running the original repro on currentdevelopno longer crashes for any of the previously failing perfConfigs.This PR adds a regression guard so the issue stays fixed.
What it tests
A new directory
mlir/test/rocmlir-tuning-driver/nightly/containing:lit.local.cfg- gates the directory onconfig.enable_rock_driver_e2e_test), matchingmlir/test/hipblaslt/lit.local.cfg, andgfx950(FP4 scaled-MFMA), matchingmlir/test/e2e/*F4.cfg,mlir/test/e2e/*Scaled*.cfg, andmlir/test/fusion/pr-e2e/mixr-gemm-fp4/lit.local.cfg.fp4-scaled-gemm-tuning-regression.mlir- exercises three originally-crashing perfConfigs (one from the issue body, two from the comments) across two GEMM shapes viarocmlir-tuning-driver. The driver hard-codesoptLevel=3(mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp:684), so this exercises the same optimization level that originally triggered the crash.If a regression returns the test will fail loudly: the driver would either crash (non-zero exit) or print
N/Afor the perfConfig, both of whichFileCheckrejects.Why nightly only
gfx950hardware and pulls real AMDGPU codegen atoptLevel=3for non-trivial GEMM shapes (one is1 x 50272 x 768), which is heavier than typical PR CI tests.mlir/test/hipblaslt/).Coverage check
The regression isn't already covered:
rocmlir-tuning-driverwithf4E2M1/-scaledGemm. The existingmlir/test/rocmlir-tuning-driver/benchmark-config.mlirusesf16.PrGemmScaled.toml,GemmScaled*.toml,GemmDirectToLDSF4.toml,GemmVariantsF4.toml,gemm_scaled_split_k_f4.toml,mixr-gemm-fp4/*) all go throughmlir-runnerwith auto-picked perfConfigs, notrocmlir-tuning-driverwith the originally failing configs.Test plan
enable_rock_driver_e2e_test=1) on a gfx950 host: the test discovers, runs, and passes (~3s on 8x MI355X, all three previously-crashing perfConfigs produce numeric runtimes).enable_rock_driver_e2e_test=0): the test is correctly reportedUNSUPPORTEDand skipped.Made with Cursor