Add nightly regression test for FP4 scaled-GEMM tuning crash by umangyadav · Pull Request #2368 · ROCm/rocMLIR

umangyadav · 2026-05-07T20:43:54Z

Summary

Fixes https://github.com/ROCm/rocMLIR-internal/issues/2124
Adds a nightly-only regression test that pins the previously-crashing perfConfigs from ROCm/rocMLIR-internal#2124 so any regression in the AMDGPU backend codegen path for FP4 scaled-GEMMs at optLevel=3 is caught in nightly CI on gfx950.

Background

Issue 2124 reported that rocmlir-tuning-driver crashed during compilation of FP4 scaled-GEMMs at optLevel=3 for specific perfConfigs (e.g. v3:128,256,8,32,16,32,1,1,2,1,1) on gfx950. The bug was in the AMDGPU backend (tracked as SWDEV-566229 / SWDEV-564678) and has since been picked up via upstream LLVM merges; running the original repro on current develop no longer crashes for any of the previously failing perfConfigs.

This PR adds a regression guard so the issue stays fixed.

What it tests

A new directory mlir/test/rocmlir-tuning-driver/nightly/ containing:

lit.local.cfg - gates the directory on
- the nightly CI configuration (config.enable_rock_driver_e2e_test), matching mlir/test/hipblaslt/lit.local.cfg, and
- gfx950 (FP4 scaled-MFMA), matching mlir/test/e2e/*F4.cfg, mlir/test/e2e/*Scaled*.cfg, and mlir/test/fusion/pr-e2e/mixr-gemm-fp4/lit.local.cfg.
fp4-scaled-gemm-tuning-regression.mlir - exercises three originally-crashing perfConfigs (one from the issue body, two from the comments) across two GEMM shapes via rocmlir-tuning-driver. The driver hard-codes optLevel=3 (mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp:684), so this exercises the same optimization level that originally triggered the crash.

If a regression returns the test will fail loudly: the driver would either crash (non-zero exit) or print N/A for the perfConfig, both of which FileCheck rejects.

Why nightly only

Requires gfx950 hardware and pulls real AMDGPU codegen at optLevel=3 for non-trivial GEMM shapes (one is 1 x 50272 x 768), which is heavier than typical PR CI tests.
Mirrors the existing nightly-only pattern for hardware-specific regression tests (mlir/test/hipblaslt/).

Coverage check

The regression isn't already covered:

The 3 failing perfConfig strings appear nowhere else in the repo.
No existing test combines rocmlir-tuning-driver with f4E2M1/-scaledGemm. The existing mlir/test/rocmlir-tuning-driver/benchmark-config.mlir uses f16.
Existing FP4 e2e tests (PrGemmScaled.toml, GemmScaled*.toml, GemmDirectToLDSF4.toml, GemmVariantsF4.toml, gemm_scaled_split_k_f4.toml, mixr-gemm-fp4/*) all go through mlir-runner with auto-picked perfConfigs, not rocmlir-tuning-driver with the originally failing configs.

Test plan

On nightly mode (enable_rock_driver_e2e_test=1) on a gfx950 host: the test discovers, runs, and passes (~3s on 8x MI355X, all three previously-crashing perfConfigs produce numeric runtimes).
On PR mode (enable_rock_driver_e2e_test=0): the test is correctly reported UNSUPPORTED and skipped.
CI nightly run on the next scheduled trigger.

Made with Cursor

Pins the previously-crashing perfConfigs from ROCm/rocMLIR-internal#2124 so that any regression in the AMDGPU backend codegen path for FP4 scaled-GEMMs at optLevel=3 is caught by nightly CI on gfx950. The new directory mlir/test/rocmlir-tuning-driver/nightly/ is gated on: - the nightly CI configuration (config.enable_rock_driver_e2e_test), matching the convention in mlir/test/hipblaslt/lit.local.cfg, and - gfx950 (FP4 scaled-MFMA), matching the convention used by every other FP4 test in the repo (mlir/test/e2e/*F4.cfg, *Scaled*.cfg, mlir/test/fusion/pr-e2e/mixr-gemm-fp4/lit.local.cfg). The test runs three originally-crashing perfConfigs (one from the issue body, two from the comments) across two GEMM shapes via rocmlir-tuning-driver, which keeps optLevel=3 - the same setting that triggered the crash. Co-authored-by: Cursor <cursoragent@cursor.com>

justinrosner · 2026-05-07T20:56:00Z

Q: Why not move this to GemmVariantsF4.toml and just add a new perf_config axis to that existing test?

I could do that but this test is only looking for compilation success not really for E2E correctness. Let me know your thoughts. i can move it to GemmVariantsF4 as well but in future some arch may not be compatible with this particular perf configs

GemmVariantsF4 looks like it's already confined to just running on gfx950:

if not (hasattr(config, 'arch') and config.arch and "gfx950" in config.arch): config.unsupported = True

If we can get a check that verifies compilation as well as E2E functionality, then I'm in favor of that

@justinrosner

Per @justinrosner's review on #2368, replace the standalone rocmlir-tuning-driver-based regression test with a focused E2E toml (GemmScaledF4Regression) so the previously-crashing perfConfigs are exercised end-to-end (compile + run) on gfx950 instead of compile-only. Each [[suite.test]] pins one (shape, perf_config) pair from the original failing scenarios in ROCm/rocMLIR-internal#2124, giving exactly one generated test per scenario (no combinatorial explosion) while staying faithful to the scaled-GEMM context where the AMDGPU codegen crash originally occurred. Co-authored-by: Cursor <cursoragent@cursor.com>

umangyadav requested a review from causten as a code owner May 7, 2026 20:43

umangyadav requested review from dhernandez0, justinrosner and stefankoncarevic May 7, 2026 20:47

umangyadav self-assigned this May 7, 2026

Merge branch 'develop' into add-fp4-scaled-gemm-tuning-regression-test

fc0f2d0

justinrosner reviewed May 7, 2026

View reviewed changes

umangyadav and others added 2 commits May 7, 2026 22:39

Merge branch 'develop' into add-fp4-scaled-gemm-tuning-regression-test

d0bde5f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nightly regression test for FP4 scaled-GEMM tuning crash#2368

Add nightly regression test for FP4 scaled-GEMM tuning crash#2368
umangyadav wants to merge 4 commits intodevelopfrom
add-fp4-scaled-gemm-tuning-regression-test

umangyadav commented May 7, 2026 •

edited

Loading

Uh oh!

justinrosner May 7, 2026

Uh oh!

umangyadav May 7, 2026

Uh oh!

justinrosner May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

umangyadav commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

What it tests

Why nightly only

Coverage check

Test plan

Uh oh!

justinrosner May 7, 2026

Choose a reason for hiding this comment

Uh oh!

umangyadav May 7, 2026

Choose a reason for hiding this comment

Uh oh!

justinrosner May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

umangyadav commented May 7, 2026 •

edited

Loading