Skip to content

Add nightly regression test for FP4 scaled-GEMM tuning crash#2368

Open
umangyadav wants to merge 4 commits intodevelopfrom
add-fp4-scaled-gemm-tuning-regression-test
Open

Add nightly regression test for FP4 scaled-GEMM tuning crash#2368
umangyadav wants to merge 4 commits intodevelopfrom
add-fp4-scaled-gemm-tuning-regression-test

Conversation

@umangyadav
Copy link
Copy Markdown
Member

@umangyadav umangyadav commented May 7, 2026

Summary

Fixes https://github.com/ROCm/rocMLIR-internal/issues/2124
Adds a nightly-only regression test that pins the previously-crashing perfConfigs from ROCm/rocMLIR-internal#2124 so any regression in the AMDGPU backend codegen path for FP4 scaled-GEMMs at optLevel=3 is caught in nightly CI on gfx950.

Background

Issue 2124 reported that rocmlir-tuning-driver crashed during compilation of FP4 scaled-GEMMs at optLevel=3 for specific perfConfigs (e.g. v3:128,256,8,32,16,32,1,1,2,1,1) on gfx950. The bug was in the AMDGPU backend (tracked as SWDEV-566229 / SWDEV-564678) and has since been picked up via upstream LLVM merges; running the original repro on current develop no longer crashes for any of the previously failing perfConfigs.

This PR adds a regression guard so the issue stays fixed.

What it tests

A new directory mlir/test/rocmlir-tuning-driver/nightly/ containing:

  • lit.local.cfg - gates the directory on
    • the nightly CI configuration (config.enable_rock_driver_e2e_test), matching mlir/test/hipblaslt/lit.local.cfg, and
    • gfx950 (FP4 scaled-MFMA), matching mlir/test/e2e/*F4.cfg, mlir/test/e2e/*Scaled*.cfg, and mlir/test/fusion/pr-e2e/mixr-gemm-fp4/lit.local.cfg.
  • fp4-scaled-gemm-tuning-regression.mlir - exercises three originally-crashing perfConfigs (one from the issue body, two from the comments) across two GEMM shapes via rocmlir-tuning-driver. The driver hard-codes optLevel=3 (mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp:684), so this exercises the same optimization level that originally triggered the crash.

If a regression returns the test will fail loudly: the driver would either crash (non-zero exit) or print N/A for the perfConfig, both of which FileCheck rejects.

Why nightly only

  • Requires gfx950 hardware and pulls real AMDGPU codegen at optLevel=3 for non-trivial GEMM shapes (one is 1 x 50272 x 768), which is heavier than typical PR CI tests.
  • Mirrors the existing nightly-only pattern for hardware-specific regression tests (mlir/test/hipblaslt/).

Coverage check

The regression isn't already covered:

  • The 3 failing perfConfig strings appear nowhere else in the repo.
  • No existing test combines rocmlir-tuning-driver with f4E2M1/-scaledGemm. The existing mlir/test/rocmlir-tuning-driver/benchmark-config.mlir uses f16.
  • Existing FP4 e2e tests (PrGemmScaled.toml, GemmScaled*.toml, GemmDirectToLDSF4.toml, GemmVariantsF4.toml, gemm_scaled_split_k_f4.toml, mixr-gemm-fp4/*) all go through mlir-runner with auto-picked perfConfigs, not rocmlir-tuning-driver with the originally failing configs.

Test plan

  • On nightly mode (enable_rock_driver_e2e_test=1) on a gfx950 host: the test discovers, runs, and passes (~3s on 8x MI355X, all three previously-crashing perfConfigs produce numeric runtimes).
  • On PR mode (enable_rock_driver_e2e_test=0): the test is correctly reported UNSUPPORTED and skipped.
  • CI nightly run on the next scheduled trigger.

Made with Cursor

Pins the previously-crashing perfConfigs from ROCm/rocMLIR-internal#2124 so
that any regression in the AMDGPU backend codegen path for FP4 scaled-GEMMs
at optLevel=3 is caught by nightly CI on gfx950.

The new directory mlir/test/rocmlir-tuning-driver/nightly/ is gated on:
- the nightly CI configuration (config.enable_rock_driver_e2e_test),
  matching the convention in mlir/test/hipblaslt/lit.local.cfg, and
- gfx950 (FP4 scaled-MFMA), matching the convention used by every other
  FP4 test in the repo (mlir/test/e2e/*F4.cfg, *Scaled*.cfg,
  mlir/test/fusion/pr-e2e/mixr-gemm-fp4/lit.local.cfg).

The test runs three originally-crashing perfConfigs (one from the issue
body, two from the comments) across two GEMM shapes via
rocmlir-tuning-driver, which keeps optLevel=3 - the same setting that
triggered the crash.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why not move this to GemmVariantsF4.toml and just add a new perf_config axis to that existing test?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do that but this test is only looking for compilation success not really for E2E correctness. Let me know your thoughts. i can move it to GemmVariantsF4 as well but in future some arch may not be compatible with this particular perf configs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GemmVariantsF4 looks like it's already confined to just running on gfx950:

if not (hasattr(config, 'arch') and config.arch and "gfx950" in config.arch):
    config.unsupported = True

If we can get a check that verifies compilation as well as E2E functionality, then I'm in favor of that

umangyadav and others added 2 commits May 7, 2026 22:39
Per @justinrosner's review on #2368, replace the standalone
rocmlir-tuning-driver-based regression test with a focused E2E toml
(GemmScaledF4Regression) so the previously-crashing perfConfigs are
exercised end-to-end (compile + run) on gfx950 instead of compile-only.

Each [[suite.test]] pins one (shape, perf_config) pair from the original
failing scenarios in ROCm/rocMLIR-internal#2124, giving exactly one
generated test per scenario (no combinatorial explosion) while staying
faithful to the scaled-GEMM context where the AMDGPU codegen crash
originally occurred.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants