Lazy load HIP library to avoid LLVM versions incompatibilities by umangyadav · Pull Request #2357 · ROCm/rocMLIR

umangyadav · 2026-04-24T12:33:46Z

Motivation

Running any of rocMLIR's compiler tools (rocmlir-driver, rocmlir-opt, xmir-runner, rocmlir-tuning-driver, MLIR's mlir-runner JIT) against a system-installed with TheRock 7.12 crashes at process startup with one of:

LLVM ERROR: Option 'default' already exists!

or, in release builds:

SmallPtrSet.h:301: Assertion 'Bucket < End' failed.

before main() is even reached.

The root cause is a static-initializer collision between two LLVMs in the same address space:

rocMLIR's tools embed the in-tree libLLVMSupport.so.<MAJOR>git, etc.
Linking libamdhip64 transitively pulls in libamd_comgr, which in turn pulls in ROCm's monolithic libLLVM.so.<MAJOR>.

Both LLVMs run their cl::opt static initializers; the dynamic linker unifies the cl::opt global registry across the split-vs-monolithic libraries; the second registration trips the duplicate-option assertion. The same problem also manifests in mlir-runner / xmir-runner JIT paths that dlopen mlir_rocm_runtime.so (which itself linked libamdhip64).

This PR removes every link-time dependency on libamdhip64 / libhsa-runtime64 / libhiprtc from rocMLIR-side libraries and from upstream MLIR's mlir_rocm_runtime and MLIRRocmExecutionEngineUtils, replacing them with a delay-load helper that opens the ROCm runtime in a private link-map namespace via dlmopen(LM_ID_NEWLM, ...) on glibc (and the equivalent on other platforms). The two LLVMs no longer share a symbol scope, the cl::opt registry stays single-instance, and the tools start cleanly against any ROCm install on LD_LIBRARY_PATH.

The end-to-end command that reproduced the original crash (./bin/rocmlir-driver --kernel-pipeline=migraphx,highlevel --arch gfx942 ../test.mlir) now runs without aborting.

Technical Details

The branch is 33 commits, 15 of them tagged [EXTERNAL] for upstream LLVM submission. Net diff: +2618 / −561 lines across 46 files.

Architecture

A new public API in mlir::rocm_loader::loadRocmLibrary(...) (header mlir/ExecutionEngine/RocmRuntimeLoader.h, implementation in lib/ExecutionEngine/RocmRuntimeLoader.cpp, built as a small MLIRRocmRuntimeLoader library) opens HIP / HIPRTC / HSA without ever bringing them into the host's symbol scope:

glibc: dlmopen(LM_ID_NEWLM, ..., RTLD_LAZY) — fresh link-map namespace; ROCm's libLLVM cannot interpose ours.
Other POSIX (musl, ...): dlopen(..., RTLD_LAZY | RTLD_LOCAL) with a one-time LLVM_DEBUG advisory that isolation is incomplete and the host must hide its own LLVM exports.
Windows: LoadLibraryW (DLLs have private scopes per-DLL natively); SONAME is converted UTF-8 → UTF-16 to match llvm/lib/Support/Windows/DynamicLibrary.inc.

Cross-library coordination

KFD (the AMD kernel-fusion-driver) only allows one user-space HSA session per process. Multiple consumers (AmdArchDb for rock.arch="native", mlir_rocm_runtime's JIT wrappers, rocmlir-tuning-driver's benchmark loop) must therefore share one HIP handle. RocmSystemDetect.cpp is the canonical owner and exports extern "C" LLVM_ALWAYS_EXPORT void *mlirRocmSystemDetectGetHipHandle(). Other consumers' loadRocmLibrary(Hip, Auto) looks this symbol up via RTLD_DEFAULT and reuses the returned handle. HIPRTC and HSA load into HIP's link-map namespace via the loader's relatedHandle parameter, so all three share the same KFD session.

Defense-in-depth

libmlir_rocm_runtime.so gets a version script (mlir_rocm_runtime.map) that exports only the mgpu* C ABI entry points. nm -D confirms 25 mgpu* symbols, nothing else.
All rocMLIR-produced shared libraries get -Wl,-Bsymbolic-functions so intra-library cl::* calls bind to in-library definitions and cannot be interposed by a later-loaded libLLVM.
All rocMLIR tools (rocmlir-driver, rocmlir-opt, ...) get LINKER:--exclude-libs,ALL so they do not re-export LLVM statics from libLLVMSupport.a.

Version compatibility

The loader is intentionally version-agnostic: it tries the unversioned SONAME first (libamdhip64.so / amdhip64.dll, the standard find_package(hip) resolution), then iterates libamdhip64.so.<MAJOR> for descending MAJOR up to a generous kMaxProbedRocmMajor = 99. There is no compile-time floor or ceiling on the ROCm version this code supports — ROCm 4.x through any future release works with no source changes.

Other improvements in this PR

While auditing the affected paths, the branch also fixes a handful of pre-existing latent bugs and missing test coverage:

AmdArchDb::nativeArchInfo cache key was gcnArchName, which would silently return device 0's CU/XCC/per-CU-shared-mem data for every later device on a same-arch multi-GPU system. Changed to deviceId (llvm::DenseMap<unsigned, AmdArchInfo>).
parseArchString / ParamLookupTable::normalizeArch silently treated native:foo, native:1abc, native: as native:0. Now report_fatal_errors with a precise diagnostic.
rocmlir-tuning-driver's HipDelayLoad previously returned a partially-initialized symbol table on missing libamdhip64 / libhiprtc; the macros in HipDelayLoadMacros.h then dispatched through null function pointers and segfaulted. Tightened the contract to "fully loaded or process aborts" via [[noreturn]] abortMissing* helpers.
MLIRRocmExecutionEngineUtils CMake gating consumed ${hip_INCLUDE_DIR} outside the if(MLIR_ENABLE_ROCM_RUNNER) gate that ran find_package(hip). In BUILD_FAT_LIBROCKCOMPILER=ON builds (which force the runner off) this would silently fall through to ambient state. Moved the entire target definition inside the gate.

Notable touched files

Area	Files
Loader (upstream LLVM)	`external/llvm-project/mlir/include/mlir/ExecutionEngine/RocmRuntimeLoader.h`, `lib/ExecutionEngine/RocmRuntimeLoader.cpp`, `RocmSystemDetect.{h,cpp}`, `RocmRuntimeWrappers.cpp`, `mlir_rocm_runtime.map`, `lib/ExecutionEngine/CMakeLists.txt`
rocMLIR consumers	`mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp`, `mlir/lib/Dialect/Rock/IR/CMakeLists.txt`, `mlir/lib/Dialect/Rock/Tuning/ParamLookupTable.cpp`, `mlir/tools/rocmlir-tuning-driver/{HipDelayLoad.{h,cpp},HipDelayLoadMacros.h,CMakeLists.txt,CacheFlush.cpp,rocmlir-tuning-driver.cpp}`, `mlir/tools/rocmlir-lib/CMakeLists.txt`
Build infrastructure	`CMakeLists.txt` (root), `cmake/llvm-project.cmake`, `mlir/utils/jenkins/static-checks/get_fat_library_deps_list.pl`
Tests	`mlir/unittests/Dialect/Rock/{AmdArchDbTests.cpp,RocmRuntimeLoaderTests.cpp,CMakeLists.txt}`, `mlir/test/Dialect/Rock/{native_arch.mlir,native_arch_invalid.mlir,Loader/*}`, `mlir/test/{lit.cfg.py,lit.site.cfg.py.in}`

Test Plan

New regression tests (added in this PR)

Test	Type	Pins
`Dialect/Rock/Loader/no_rocm_neededs.test`	lit / shell	No shipped binary or ROCm-aware shared library has `libamdhip64`, `libhiprtc`, `libamd_comgr`, or ROCm's monolithic `libLLVM.so` in its `DT_NEEDED` set. Audits 8 artefacts (default build) / 4 (fat-lib).
`Dialect/Rock/Loader/dynsym_only_mgpu.test`	lit / shell	`libmlir_rocm_runtime.so` exports nothing but `mgpu*` entry points (the version-script contract).
`Dialect/Rock/native_arch.mlir` (extended)	lit	`rock.arch = "native"` and `"native:0"` both flow through the rock pipeline and produce the same lowered IR as a kernel pinned to the concrete arch.
`Dialect/Rock/native_arch_invalid.mlir` (new)	lit	`rock.arch = "native:foo"` aborts with the precise diagnostic via `not --crash`.
`RocmRuntimeLoaderTests` (4 cases)	gtest	`Auto` policy reuses `RocmSystemDetect`'s HIP handle (proves the `RTLD_DEFAULT` cross-library coordination); `Owned` policy returns a usable handle; `resolveRocmSymbol` is null-safe; loader resolves HIP without compile-time version knowledge.
`NativeArchParseTest` (2 cases)	gtest / death tests	Four malformed `native:*` inputs each abort with the precise diagnostic; bare `native` is well-formed.
`NativeArchCacheTest.SameArchMultiGpuDistinct`	gtest	Per-device cache contract on multi-GPU systems (skipped when fewer than 2 same-arch GPUs visible).

Manual / verification steps

The original failing command runs end-to-end: bin/rocmlir-driver --kernel-pipeline=migraphx,highlevel --arch gfx942 ../test.mlir.
bin/rocmlir-gen -p -t f32 --arch gfx942 -mfma=on | bin/rocmlir-driver --kernel-pipeline=full --arch gfx942 produces a valid gpu.binary on both build configurations.
nm -D libmlir_rocm_runtime.so | grep -v '^mgpu' returns nothing.
readelf -d on every shipped binary / shared library confirms no libamdhip64, libhiprtc, libamd_comgr, or ROCm's libLLVM.so in NEEDED.
Synthesized "future LLVM 99 + ROCm 99" tree (symlinks pointing the build's artefacts at fake-version filenames) confirms the version-agnostic globs and SONAME fallback still work.
Failure-injection: a non-ELF text file passed as a synthetic artefact correctly trips the readelf -d ... failed / nm -D ... failed diagnostics.

Build configurations exercised

Default shared build: cmake -G Ninja .. -DCMAKE_BUILD_TYPE=RelWithDebInfo (default BUILD_SHARED_LIBS=ON, MLIR_ENABLE_ROCM_RUNNER=ON).
Fat-library build: cmake -G Ninja .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_FAT_LIBROCKCOMPILER=On (forces BUILD_SHARED_LIBS=OFF, MLIR_ENABLE_ROCM_RUNNER=0). Used by the AMDMIGraphX integration.

Test Result

Build	Suite	Result
Default (shared)	`ninja check-rocmlir` (1559 tests)	1347 passed, 0 failed (207 unsupported, 5 expectedly failed)
Default (shared)	`MLIRRockUnitTests` (118 cases)	118 / 118 pass
Fat-lib	`ninja check-rocmlir` (933 tests)	478 passed, 0 failed (452 unsupported, 3 expectedly failed)
Fat-lib	`MLIRRockUnitTests` (114 cases)	114 / 114 pass
`nm`/`readelf` cleanliness audit (default)	8 artefacts	All clean — no forbidden NEEDED, only `mgpu*` exported
`nm`/`readelf` cleanliness audit (fat-lib)	4 artefacts	All clean
`clang-format` on touched files	—	clean
`ReadLints` on touched files	—	no errors

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

`MLIRRockOps` linked `hip::host`/`hip::amdhip64` (and `hsa-runtime64`) directly so it could query GPU properties for `--arch native:N`. On recent ROCm, `libamdhip64` transitively depends on `libamd_comgr.so`, which in turn depends on ROCm's `libLLVM.so.23`. Any binary that embeds rocMLIR's own LLVM (e.g. `rocmlir-driver`, `rocmlir-opt`) therefore ended up with two LLVMs in the same address space. Their `cl::opt` static initializers fight over the global option registry and the process aborts with `LLVM ERROR: Option '...' already exists!`. Confine HIP/HSA to a tiny standalone shared library (`mlir_rocm_arch_runtime`) and load it lazily from `AmdArchDb.cpp` via `dlmopen(LM_ID_NEWLM, ...)` on glibc / `LoadLibraryW` on Windows. The fresh dynamic-linker namespace keeps ROCm's `libLLVM.so` from snapping onto rocMLIR's embedded LLVM, so duplicate `cl::opt` registration can no longer occur. Tools that never request `--arch native:N` never load the runtime and stay completely free of HIP/comgr/LLVM-23. Highlights: - New `mlir/lib/ExecutionEngine/RocmArchRuntime.{h,cpp}`: stable C ABI (`mlirRocmArchRuntimeAbiVersion`, `mlirRocmArchRuntimeDeviceCount`, `mlirRocmArchRuntimeGetProperties`) implemented on top of HIP+HSA. Built with plain `add_library(... SHARED)` so it does NOT pull in `LLVMSupport`. - `MLIRRockOps` drops all HIP/HSA link-time deps. New runtime loader in `AmdArchDb.cpp` searches `<exe-dir>`, `<exe-dir>/../lib/`, `ROCMLIR_BUILD_RUNTIME_DIR` (compile-time path for nested test binaries), then the system loader path. - Public helpers `nativeDeviceCount()` and `nativeArchName()` added so callers (unit tests, Python bindings) can use the runtime without including HIP headers. - `ParamLookupTable::normalizeArch` learns to canonicalise `native[:N]` by routing through the runtime, so `rock.arch = "native"` flows through the full backend pipeline end-to-end. - `mlir/lib/CMakeLists.txt` wires `add_dependencies(MLIRRockOps mlir_rocm_arch_runtime)` after both subdirectories are processed, and exposes `ROCMLIR_BUILD_RUNTIME_DIR` as a compile def pointing at `LIBRARY_OUTPUT_DIRECTORY` (POSIX `.so`) or `RUNTIME_OUTPUT_DIRECTORY` (Windows `.dll`). - Fat-build (`BUILD_FAT_LIBROCKCOMPILER=ON`): the runtime stays as a separate shared library (intentionally NOT bundled into `librockCompiler.a`) and is excluded from `librockcompiler_deps.cmake` via an explicit allowlist filter in `get_fat_library_deps_list.pl`. Tests: - `MLIRRockUnitTests::AmdArchDbTests` rewritten to use the public AmdArchDb API; no longer includes `hip/hip_runtime_api.h` and is built unconditionally on all platforms. - New `mlir/test/Dialect/Rock/native_arch.mlir` lit test exercising `rock.arch = "native"` through `-rock-affix-params`. Gated on a new `amd-gpu-present` lit feature so it skips on no-GPU CI machines. Validated: - Original repro `rocmlir-driver --kernel-pipeline=migraphx,highlevel --arch gfx942` produces output identical to baseline. - `rock.arch = "native"` end-to-end through `--kernel-pipeline=full` emits a `gpu.binary` with `chip = "gfx942"`. - `ldd` of `MLIRRockOps.so`, `rocmlir-driver`, `rocmlir-opt`, and `MLIRRockUnitTests` is clean: no `libamdhip64`, no `libamd_comgr`, no `libLLVM.so.23`, no `libclang-cpp.so.23`. - Unit tests: 111/111 pass; 8 NativeArch tests run against real MI300X hardware and match presets. - Rock lit suite (excl. `integration/`): 102/102 pass. - Fat build configured with locally-installed rocm-cmake builds the static `librockCompiler.a` plus the separate `mlir_rocm_arch_runtime.so` and reproduces both the original `gfx942` case and the `rock.arch = "native"` case. Made-with: Cursor

When a process embeds rocMLIR's LLVM and also loads libmlir_rocm_runtime.so (e.g. mlir-runner --shared-libs=libmlir_rocm_runtime.so), the runtime wrapper brings ROCm's libamdhip64 into the address space. libamdhip64 transitively pulls /opt/rocm/lib/llvm/lib/libLLVM.so.23 via libamd_comgr, and the dynamic linker unifies llvm::cl::SubCommand::getTopLevel() and the rest of the cl::opt global registry across both LLVM instances. Static initializers in ROCm's LLVM then race with rocMLIR's LLVM and the process aborts with "LLVM ERROR: Option '...' already exists!". Hide everything that is not mgpu* from libmlir_rocm_runtime.so's dynsym: * CXX_VISIBILITY_PRESET=hidden and VISIBILITY_INLINES_HIDDEN=ON so non-annotated symbols default to STV_HIDDEN at compile time. * -Wl,--exclude-libs,ALL so symbols pulled in from any static archive (hip device runtime, clang-rt, libcxx internals) are made STB_LOCAL at link time even if they were compiled with default visibility. * -Wl,--version-script=mlir_rocm_runtime.map to assert the contract at link time: only mgpu* may be exported. Anything else fails the link rather than silently re-exposing LLVM helpers. Windows is already default-hidden for DLL symbols, so the guard is a no-op there. Apple's ld does not support GNU version scripts, so it is also skipped. Made-with: Cursor

Now that libmlir_rocm_runtime.so hides its LLVM symbols at link time (preceding commit), the `cl::opt` collision that motivated a separate, dlmopen-isolated shim library no longer manifests. Fold the HIP/HSA delay-loader back into MLIRRockOps itself using plain dlopen (POSIX) / LoadLibraryW (Windows). Why this is safe without dlmopen: * MLIRRockOps does not link libamdhip64 / libhsa-runtime64; the shared objects are resolved at run time by SONAME only, so the dynamic linker only searches for them when `rock.arch = "native[:N]"` is asked for. * When HIP is loaded this way, RTLD_LOCAL keeps its transitive deps (including ROCm's libLLVM.so via libamd_comgr) in HIP's own scope: they are not re-exposed to us, and they will not satisfy undefined references from rocMLIR's embedded LLVM. * rocMLIR binaries embedding LLVM do not re-export cl:: state through their dynsym either (the LLVM libraries are built with default -fvisibility-inlines-hidden), so even when HIP's libLLVM.so mmap's into the process, its static initializers query the rocMLIR-side cl::opt globals through HIP-local symbol resolution and never see rocMLIR's own registry. The two LLVMs stay disjoint. Why in-process delay-load is better than the shim: * No separate install artifact for downstream consumers (MIOpen, MIGraphX) to ship; a ROCm install visible on LD_LIBRARY_PATH / RPATH is sufficient. * No ABI boundary to version (the shim had MLIR_ROCM_ARCH_RUNTIME_ABI_VERSION plus a POD properties struct); fewer moving parts. * No dlmopen, which is a glibc-specific extension with known gotchas (TLS, static destructors, fork()) and no counterpart on musl. Details: * AmdArchDb.cpp: two tiny lazy loaders (HipRuntime, HsaRuntime) resolve the few symbols we need (hipGetDeviceCount, hipGetDevicePropertiesR0600 with a fallback to hipGetDeviceProperties, hsa_init, hsa_iterate_agents, hsa_agent_get_info). SONAME candidates cover ROCm 6 and 7 and unversioned installs. The HIP and HSA headers are included only for POD layouts; they are not link-time deps. * CMakeLists.txt for MLIRRockOps now does `find_package(hip QUIET)` and only picks up include dirs; no hip::host / hip::amdhip64 link. HSA headers are likewise include-only. * Navi WGP-as-CU correction for shared-memory per CU runs before checkAndSetInfo() so the HSA-assisted value flows through the same code path as before. * The mlir_rocm_arch_runtime target, RocmArchRuntime.{h,cpp}, ROCMLIR_BUILD_RUNTIME_DIR compile def, and the Perl allowlist for the fat-lib Perl filter are all removed. * Unit test and lit test comments are updated to match the new loader path. Their behaviour is unchanged. Verified: readelf -d build/lib/libMLIRRockOps.so.2.0 shows no NEEDED entries for libamdhip64, libhsa-runtime64, libamd_comgr, or libLLVM.so.*. The full chain will be re-validated in the next commit. Made-with: Cursor

Plain `dlopen(RTLD_LAZY | RTLD_LOCAL)` is insufficient to break the `cl::opt` collision against ROCm's libLLVM.so.23 in practice. Even though RTLD_LOCAL stops us from re-exporting symbols from the dlopened module, it does not hide rocMLIR's own process-global LLVM symbols from the dynamic linker when libLLVM.so.23 runs its static initializers. On glibc, libLLVM.so.23 is pulled in transitively by libamd_comgr, and its static-init cl::opt registrations resolve against rocMLIR's unversioned cl::* symbols, which registers the same option twice in rocMLIR's registry. `rocmlir-opt -rock-affix-params` on mlir/test/Dialect/Rock/native_arch.mlir reproduces this as: LLVM ERROR: Option 'default' already exists! ... libLLVMCodeGen.so.22.0git llvm::report_fatal_error(...) libLLVM.so.23.0git <static init> ld-linux-x86-64.so.2 _dl_catch_exception dlmopen(LM_ID_NEWLM, ...) places libamdhip64 and its transitive deps into a fresh link-map namespace where rocMLIR's symbols are not visible, so libLLVM.so.23's static init can only resolve its cl::* references against its own copies. The two LLVM instances stay completely disjoint. The non-glibc POSIX fallback keeps plain dlopen + RTLD_LOCAL. That branch only works if the process also hides its LLVM surface via build-side discipline (`-Wl,--exclude-libs,ALL`); see the follow-up commits that apply that to rocMLIR executables. Windows is unchanged: LoadLibraryW already gives a per-DLL private scope. Verified: * mlir/test/Dialect/Rock/native_arch.mlir passes (was crashing). * Original repro (rocmlir-driver --kernel-pipeline=migraphx,highlevel --arch gfx942) still passes. * `rock.arch = "native"` via MLIR attribute still produces a valid gpu.binary. Made-with: Cursor

Add -Wl,--exclude-libs,ALL to add_rocmlir_tool(). Any static archive that gets pulled into a tool (e.g. libLLVMSupport.a in the fat-lib build, or intermediates on a conventional static build) is turned into STB_LOCAL by the linker rather than being re-exported from the executable's dynsym. This is load-bearing on the non-dlmopen branches of AmdArchDb.cpp and on musl / FreeBSD where dlmopen is unavailable: if the tool re-exports llvm::cl::* globals, a later dlopen of libamd_comgr / libLLVM.so.23 will interpose against them and crash with "Option '...' already exists" at static-init time. The flag is a no-op when every LLVM/MLIR dependency is already a shared library (the BUILD_SHARED_LIBS=ON default), so landing it now does not change the shape of the built tools on this developer machine. It still measurably reduces their dynamic symbol surface: dynsym on rocmlir-opt drops from ~34 template weak symbols to 1, making future regressions easier to spot with `nm -D`. Apple's linker does not accept --exclude-libs; the guard matches the one used for the libmlir_rocm_runtime.so version script. Made-with: Cursor

The preceding commit paired --version-script=mlir_rocm_runtime.map with CXX_VISIBILITY_PRESET=hidden. The combination over-hides: the mgpu* entry points in upstream RocmRuntimeWrappers.cpp are not annotated with __attribute__((visibility("default"))), so the hidden preset marks them STV_HIDDEN at compile time and the version script has nothing to export (`global: mgpu*;` only overrides global/local partitioning, it does not un-hide compile-time-hidden symbols). The result is a libmlir_rocm_runtime.so with an empty dynsym, which breaks any mlir-runner / xmir-runner invocation that tries to dlsym an mgpu* entry point for JIT launches. Fix: use the version script alone and rely on default visibility. The script still hides everything that is not mgpu*, which is all we need to stop llvm::* leakage. Upstream CudaRuntimeWrappers.cpp solves the same problem with explicit MLIR_CUDA_WRAPPERS_EXPORT annotations on each entry point; we avoid that here to keep the rocMLIR-side patch minimal. Verified: nm -D --defined-only on libmlir_rocm_runtime.so now shows exactly the 25 mgpu* T entries and zero llvm::* symbols. Made-with: Cursor

Set CMAKE_SHARED_LINKER_FLAGS / CMAKE_MODULE_LINKER_FLAGS to include -Wl,-Bsymbolic-functions for the entire build on non-Windows, non-Apple targets. This affects both rocMLIR's own libMLIR*.so and the embedded LLVM's libLLVM*.so.22.0git. Effect: intra-library cl::* function calls (e.g. within libLLVMSupport.so.22.0git, libLLVMCodeGen.so.22.0git, ...) are resolved to local definitions at link time rather than going through the PLT. The dynamic linker cannot later interpose them with any other loaded shared object's cl::* implementation, in particular ROCm's libLLVM.so.23.0git (pulled in transitively via libamdhip64 -> libamd_comgr when a runner dlopens libmlir_rocm_runtime.so). Verified with `objdump -d --section=.plt` on libLLVMSupport.so.22.0git: zero cl::* PLT entries after this change. Scope caveat: -Bsymbolic-functions addresses function-symbol interposition only. Data-symbol interposition (vtables, the cl::SubCommand process-global singleton) is unaffected, so the mlir-runner / xmir-runner JIT path still hits the "Option '...' already exists!" collision during static-init. Those tests remain pre-existing upstream MLIR failures; closing them requires either a version script hiding llvm::cl::* from the dynsym of libLLVMSupport.so.* or a full CXX_VISIBILITY_PRESET=hidden on the embedded LLVM build. Both are tracked as follow-ups. Regression guard: 102/102 rock lit tests (excl. integration/) and 111/111 MLIRRockUnitTests still pass with this flag active. Made-with: Cursor

…mopen Hiding cl::* from libmlir_rocm_runtime.so's dynsym (commit 8010fdf) was necessary but not sufficient: the runtime still pulled libamdhip64 in as a direct NEEDED, which transitively dragged libamd_comgr and ROCm's libLLVM.so.23 into mlir-runner's / xmir-runner's main namespace at dlopen time. ROCm's libLLVM.so.23 then ran static initializers that unified the cl::SubCommand singleton across the two LLVMs, aborting JIT execution with "Option 'default' already exists!". Apply the same isolation pattern that mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp already uses for the rocMLIR compiler side: * Remove `hip::host hip::amdhip64` from `mlir_rocm_runtime`'s link line. Keep the HIP headers reachable via `hip_INCLUDE_DIR` so we can still see hipDeviceProp_t / hipMemcpyKind layouts. * Resolve every hipXXX entry point through a function-pointer table populated by dlmopen(LM_ID_NEWLM, "libamdhip64.so.7", RTLD_LAZY) on glibc, plain dlopen elsewhere on POSIX, and LoadLibraryA on Windows. * Bare hipXXX call sites are kept verbatim by routing them through object-like macros (`#define hipModuleLoadData(...) (::getHip().moduleLoadData(__VA_ARGS__))` etc.); the rest of the file stays byte-for-byte identical to upstream. * mgpu* entry points pick up explicit MLIR_HIP_WRAPPERS_EXPORT visibility annotations so the existing libmlir_rocm_runtime.so version script keeps exporting them after the next merge from upstream (where default-visibility was the only thing exporting them). Effect: `readelf -d build/.../libmlir_rocm_runtime.so` no longer lists libamdhip64 as NEEDED. ROCm's libamd_comgr and libLLVM.so.23 land in the private dlmopen namespace and are completely invisible to the host process's LLVM. Verified end-to-end: * mlir/test/rocmlir-driver/conv2d_harness/conv2d_harness.mlir (and siblings that previously crashed in mlir-runner): PASS. * Rock integration suite: 111/115 pass, up from 102/115. The remaining 4 failures (rock-reduce-max-case{1..4}.mlir) abort with SmallPtrSet "Bucket < End", an unrelated downstream bug in the reduce-max lowering -- they pre-date this branch. * Original gfx942 repro and rock.arch="native" attribute path still pass. * 102/102 rock lit (excl. integration) and 111/111 MLIRRockUnitTests still pass. Made-with: Cursor

…_rocm_runtime `xmir-runner` link-pulls libMLIRRocmExecutionEngineUtils.so for RocmSystemDetect, which in turn link-pulled libamdhip64 -> libamd_comgr -> ROCm's libLLVM.so.23. The chain was loaded at process start before main() ran, and libLLVM.so.23's static initializers then unified cl::SubCommand against rocMLIR's embedded libLLVMSupport.so.22, aborting in `_dl_init` with a SmallPtrSet "Bucket < End" assertion (the iterator walked off the end of a SmallPtrSet that the two LLVMs disagreed about). Fix: 1. Drop the link-time `hip::host hip::amdhip64` dependency from MLIRRocmExecutionEngineUtils. Only the HIP headers are kept, via `hip_INCLUDE_DIR`. RocmSystemDetect.cpp now resolves the two HIP entry points it actually uses (`hipGetDeviceCount`, `hipGetDevicePropertiesR0600`) through a dlmopen(LM_ID_NEWLM, ...) handle, mirroring the pattern in mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp and the preceding patch to libmlir_rocm_runtime.so. 2. Coordinate with libmlir_rocm_runtime.so so we never dlmopen libamdhip64 twice in the same process. KFD allows only one user-space HSA session per process; a second dlmopen lands in a fresh link-map namespace, gets its own HIP/HSA copy, and every call from that copy returns hipErrorNoDevice. RocmSystemDetect runs first (during JitRunner's mlirTransformer step, before --shared-libs are dlopen'd) and is therefore the canonical owner of the HIP namespace. It exports a small C ABI `mlirRocmSystemDetectGetHipHandle()` returning its dlmopen'd handle. RocmRuntimeWrappers.cpp dlsyms that symbol via RTLD_DEFAULT when its `mgpu*` table is being built, and reuses the handle if present. If the symbol is absent (host binary that does not link RocmSystemDetect), it transparently falls back to its own dlmopen. Verified end-to-end: * `xmir-runner` ldd is now clean -- no libamdhip64, no libamd_comgr, no libLLVM.so.23 in the transitive closure. * mlir/test/Dialect/Rock integration suite: 115/115 pass (was 102/115; the 4 reduce_max + ~9 other JIT tests are now green). * mlir/test/xmir suite: 59/61 pass, 1 XFAIL, 1 FAIL. The remaining FAIL (bert_part_48) is "Error loading library: ocml.bc: Unknown attribute kind (106) (Producer: 'LLVM23.0.0git' Reader: 'LLVM 22.0.0git')" -- an ROCm-vs-rocMLIR LLVM-IR version skew, not related to this work. * mlir/test/{Dialect,Conversion,CAPI,rocmlir-*,migraphx_models, xmir,mlir-rock-lib} broader suite: 368/380 (96.84%) pass, was 300/380. The remaining failures are 1 bert_part_48 above and 1 rocmlir-tuning-driver/benchmark-config.mlir; the latter has the same root cause but lives in a separate tool that direct-links ~30 HIP/HIPRTC entry points and needs its own architectural refactor. * 102/102 rock lit (excl. integration) and 111/111 MLIRRockUnitTests still pass. * Original repro (`rocmlir-driver --kernel-pipeline=migraphx,highlevel --arch gfx942`) and `rock.arch="native"` both pass. Made-with: Cursor

`rocmlir-tuning-driver` link-pulled `hip::host hip::amdhip64 hiprtc::hiprtc` directly. The transitive chain `libamdhip64 -> libamd_comgr -> libLLVM.so.<major>` made the dynamic linker map ROCm's monolithic libLLVM into the host process at startup. ROCm's libLLVM ran its `cl::opt` static initializers before main(), unifying its `cl::SubCommand` registry with rocMLIR's split-shared-libs libLLVMSupport, and the resulting iterator inconsistency aborted the process in `_dl_init` with "SmallPtrSet: Bucket < End". This affected the only previously-failing lit test: mlir/test/rocmlir-tuning-driver/benchmark-config.mlir. Resolve every hipXXX/hiprtcXXX entry point at run time instead: * Add `HipDelayLoad.{h,cpp}` (rocmlir::tuningdriver namespace) that exposes `getHipSymbols()` / `getHiprtcSymbols()` accessors. `HipDelayLoad.cpp` resolves libamdhip64 / libhiprtc via `dlmopen(LM_ID_NEWLM, ...)` on glibc, plain `dlopen(RTLD_LAZY | RTLD_LOCAL)` on other POSIX, and `LoadLibraryA` on Windows. * Coordinate with libMLIRRocmExecutionEngineUtils via the existing `mlirRocmSystemDetectGetHipHandle` cross-library back-channel: if RocmSystemDetect has already loaded HIP into a private namespace, reuse its handle. KFD permits exactly one user-space HSA session per process, so a second dlmopen would otherwise return `hipErrorNoDevice` for every call. HIPRTC is loaded into the same link-map namespace as HIP via `dlinfo(RTLD_DI_LMID)` so the JITed instruction-cache flush kernel sees the same device handles. * Add `HipDelayLoadMacros.h`, included LAST in each consumer .cpp to redirect bare `hipMalloc(...)` / `hiprtcCompileProgram(...)` etc. to the function-pointer table dispatch. This keeps rocmlir-tuning-driver.cpp and CacheFlush.cpp byte-equivalent to their upstream-style HIP code; only the include block at the top grows by two lines per file. The `hipExtModuleLaunchKernel` macro injects the trailing `flags=0` argument that the HIP header would otherwise supply as a default (default arguments do not apply when calling through a function pointer). * CMake: drop `target_link_libraries(... hip::host hip::amdhip64 hiprtc::hiprtc)`; keep the find_package calls and switch to include-only wiring via `hip_INCLUDE_DIR` / `hiprtc::hiprtc INTERFACE_INCLUDE_DIRECTORIES` plus the `__HIP_PLATFORM_AMD__=1` define. Add `${CMAKE_DL_LIBS}` for the dlmopen/dlsym calls in HipDelayLoad.cpp. * CMake: drop the `benchmark-driver-utils` link-line entry. It was dead code -- rocmlir-tuning-driver does not include `benchmarkUtils.h` or call any `benchmark::*` symbol. Keeping it pulled libamdhip64 in transitively (via `target_link_libraries( benchmark-driver-utils PUBLIC hip::host hip::amdhip64)`), so its removal was load-bearing for breaking the HIP transitive chain. Verified end-to-end on the dlopenHip branch over LLVM 23: * `readelf -d build/bin/rocmlir-tuning-driver` is now clean -- no libamdhip64, libamd_comgr, libhiprtc, or libLLVM.so.23 in the direct or transitive closure. * mlir/test/rocmlir-tuning-driver/benchmark-config.mlir now PASSES. * Full broad lit (Dialect, Conversion, CAPI, rocmlir-*, xmir, migraphx_models, mlir-rock-lib): 371/381 pass, 2 XFAIL, 8 Unsupported, 0 failures (was 370/381 + 1 fail). * rock integration suite: 115/115 pass. * xmir suite: 60/61 pass + 1 XFAIL. * MLIRRockUnitTests: 111/111 pass. * Original repro `rocmlir-driver --kernel-pipeline=migraphx,highlevel --arch gfx942` still passes; `rock.arch="native"` still passes. Made-with: Cursor

…helper Header-only utility that consolidates the dlmopen / dlopen / LoadLibraryA scaffolding that was independently re-implemented in four places on the dlopenHip branch: * external/.../ExecutionEngine/RocmRuntimeWrappers.cpp * external/.../ExecutionEngine/RocmSystemDetect.cpp * mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp * mlir/tools/rocmlir-tuning-driver/HipDelayLoad.cpp Each had its own `kHipCandidates` SONAME list, `osOpen`/`osSym` shim, and (in three of the four) its own copy of the `mlirRocmSystemDetectGetHipHandle` RTLD_DEFAULT-coordination dance. Provides: * `loadRocmLibrary(Library, relatedHandle, CoordinationPolicy)` - `Library::Hip` defaults to `CoordinationPolicy::Auto`, which consults `mlirRocmSystemDetectGetHipHandle` via RTLD_DEFAULT first so all callers share one HSA session per process (KFD only permits one). RocmSystemDetect itself opts out via `CoordinationPolicy::Owned`. - `relatedHandle` is used for HIPRTC/HSA so they are dlmopen'd into the *same* link-map namespace as HIP via dlinfo(RTLD_DI_LMID). That preserves the single-HSA-session invariant for satellite libraries. * `resolveRocmSymbol(LoadedLibrary, name)` -- thin dlsym/GetProcAddress wrapper. * Platform candidate lists (`libamdhip64.so.{7,6}` / decorated `amdhip64_*.dll`, etc.) declared once. Header-only on purpose: no new CMake target, no new linkage edge, no ABI to version. Each consumer TU gets its own static copy of the inline functions; the only cross-TU contract is the `mlirRocmSystemDetectGetHipHandle` extern-C symbol that `RocmSystemDetect.cpp` exports for RTLD_DEFAULT lookup. That stays unchanged. Two correctness improvements vs the per-file copies: 1. AmdArchDb (in MLIRRockOps) previously did its own dlmopen, bypassing the RocmSystemDetect coordination. In a process that loads both libraries (e.g. a future xmir-runner that also lowers `rock.arch = "native"`), KFD's one-session limit would have caused the second dlmopen to return `hipErrorNoDevice` for every call. The new helper makes the coordination automatic for every consumer. 2. HSA was previously dlmopen'd into a fresh namespace by AmdArchDb, even though HIP was already loaded into another namespace by the same TU. With `relatedHandle`, HSA shares HIP's namespace via dlinfo(RTLD_DI_LMID), keeping both on a single KFD session. Made-with: Cursor

Replace the per-file dlmopen / dlopen / LoadLibraryA scaffolding in RocmRuntimeWrappers.cpp and RocmSystemDetect.cpp with calls to the shared `mlir::rocm_loader` helpers. The two files now share: * One copy of the SONAME candidate lists. * One copy of the `osOpen` / `osSym` / `_GNU_SOURCE` / `<dlfcn.h>` boilerplate. * One copy of the `mlirRocmSystemDetectGetHipHandle` RTLD_DEFAULT coordination dance (RocmSystemDetect uses `CoordinationPolicy::Owned` to opt out, since it *is* the canonical owner of the handle). Net delta per file: * RocmRuntimeWrappers.cpp: -91 lines (was 75 lines of scaffolding, now 1 call to `loadRocmLibrary`) * RocmSystemDetect.cpp: -68 lines (was 60 lines of scaffolding + handle-coordination comment, now 3 lines) Drive-by cleanups in the same diff: * Remove the unused `#include <mutex>` from RocmRuntimeWrappers.cpp. * Guard `#pragma GCC diagnostic` in RocmSystemDetect.cpp behind `__GNUC__ && !__clang__` so MSVC does not emit `C4068: unknown pragma` on a Windows build. Behaviour is unchanged on glibc Linux (same dlmopen LM_ID_NEWLM, same symbol search order); test results are identical (`ninja check-rocmlir` 1337/1337 pass, fat-lib build 472/472 pass). Made-with: Cursor

Replace the per-file dlmopen scaffolding in `MLIRRockOps`'s `AmdArchDb.cpp` and `rocmlir-tuning-driver`'s `HipDelayLoad.{h,cpp}` with calls to the shared `mlir::rocm_loader` helpers (added in the preceding [EXTERNAL] commit). Both rocMLIR-side TUs now coordinate automatically with `RocmSystemDetect` through `mlirRocmSystemDetectGetHipHandle`, so a process that loads `MLIRRockOps` (for `rock.arch = "native"` lowering) and `MLIRRocmExecutionEngineUtils` (for runner target detection) shares a single HSA session instead of racing against KFD's one-session-per-process limit. Behavioural changes: 1. AmdArchDb's HIP loader now consults `mlirRocmSystemDetectGetHipHandle` via RTLD_DEFAULT before dlmopen'ing its own libamdhip64. This was the latent bug identified in review: with the prior code, a future xmir-runner path that lowered `rock.arch = "native"` would have hit `hipErrorNoDevice` for every HIP call from RockOps. The coordination now matches what `libmlir_rocm_runtime.so` and the tuning driver already do. 2. AmdArchDb's HSA loader now passes HIP's handle to `loadRocmLibrary(Library::Hsa, hipHandle)`, so HSA dlmopens into HIP's link-map namespace via dlinfo(RTLD_DI_LMID) and shares the same KFD session. Previously HSA would dlmopen LM_ID_NEWLM and `hsa_init()` could (silently) end up tied to a separate session. Drive-by cleanups in the same diff: * AmdArchDb.cpp: replace `std::unordered_map<std::string,...>` with `llvm::StringMap` per LLVM coding standards. * AmdArchDb.cpp: drop the `#include <unordered_map>` and `<cstring>` that became dead after the refactor. * AmdArchDb.cpp: fix the file-banner typo (`Dtabase` -> `Database`). * HipDelayLoad.cpp: drop the redundant `#include <hip/hip_ext.h>` (already pulled in via HipDelayLoad.h) and the now-unused `<dlfcn.h>` / `<windows.h>` blocks. * MLIRRockOps CMakeLists: guard `target_include_directories( obj.MLIRRockOps ...)` with `if (TARGET ...)` for symmetry with the upstream ExecutionEngine CMake; iterate over both target names so HIP/HSA include dirs land on whichever is the actual compile target. Net delta: * AmdArchDb.cpp: -116 lines (180 lines of dlmopen scaffolding + HipRuntime/HsaRuntime ctors collapsed to 1 call each). * HipDelayLoad.cpp: -118 lines (the entire SONAME + dlmopen + dlinfo + osSym block deleted; only the function-pointer table population remains). * HipDelayLoad.h: switch `void *handle` to `mlir::rocm_loader::LoadedLibrary lib` so the helper's opaque type is the contract. Verified: build/ (BUILD_SHARED_LIBS=ON) `check-rocmlir` 1337/1337 pass + 5 XFAIL; build_migx/ (BUILD_FAT_LIBROCKCOMPILER=ON) `check-rocmlir` 472/472 pass + 3 XFAIL. MLIRRockUnitTests 111/111. Original gfx942 repro and `rock.arch="native"` paths both still pass. Made-with: Cursor

Splits the formerly header-only `RocmDynamicLoader.h` into a small public header and an implementation file so it is fit for upstream LLVM submission: - Public header (`RocmRuntimeLoader.h`) is platform-agnostic. It no longer pulls in `<windows.h>` (which leaks `min`/`max`/`ERROR` macros) nor defines `_GNU_SOURCE` (a feature-test macro that has no business in an installed MLIR header). - Implementation (`RocmRuntimeLoader.cpp`) keeps `<windows.h>` and `_GNU_SOURCE` confined and matches upstream LLVM's `Support/Windows/DynamicLibrary.inc` pattern: UTF-8 SONAMEs are converted to UTF-16 via `convertUTF8ToUTF16String` and passed to `LoadLibraryW`. POSIX paths use `dlmopen(LM_ID_NEWLM, ...)` on glibc and degrade to `dlopen(RTLD_LAZY|RTLD_LOCAL)` elsewhere with a one-time `LLVM_DEBUG` advisory. - Replaces a zero-element `kHsaCandidates[]` on Windows (ill-formed in standard C++; rejected by MSVC) with a `(nullptr, 0)` sentinel. - Adds an `llvm_unreachable` for the impossible `Library` enum case. - Builds the new TU as a small static-library target (`MLIRRocmRuntimeLoader`) so consumers do not pay the cost of a new dynamic dependency. `mlir_rocm_runtime` and `MLIRRocmExecutionEngineUtils` link against it. Documents the cross-library coordination ABI on the canonical declaration: `mlirRocmSystemDetectGetHipHandle` is now declared in `RocmSystemDetect.h` with full rationale (KFD's one-HSA-session-per- process limit) and is used by the loader via `RTLD_DEFAULT` lookup so no link-time dependency on `MLIRRocmExecutionEngineUtils` is needed. Removes the now-stale `DISABLE_PCH_REUSE` "TODO: workaround" comment on `mlir_rocm_runtime`. Symbol containment is enforced at link time by the version script (`mlir_rocm_runtime.map`); the dynsym contract is verified by a regression test added in a follow-up commit. Made-with: Cursor

Updates the rocMLIR-side delay-load consumers (`AmdArchDb.cpp` for `rock.arch = "native"`, and `rocmlir-tuning-driver`'s `HipDelayLoad` helpers) to include the new header `mlir/ExecutionEngine/RocmRuntimeLoader.h` and to link against the new static library target `MLIRRocmRuntimeLoader`. This is purely a rename + link-target switch; behaviour is identical. The transitive `${CMAKE_DL_LIBS}` dependency that was previously declared on each consumer is now exposed `PUBLIC`-ly by `MLIRRocmRuntimeLoader` itself, so the redundant per-consumer link lines are dropped. Made-with: Cursor

Closes the test-coverage gap on three independent invariants of the multi-LLVM isolation design: 1. `Dialect/Rock/Loader/no_rocm_neededs.test` -- asserts that the shipped binaries (`rocmlir-driver`, `rocmlir-opt`, `rocmlir-gen`, `rocmlir-tuning-driver`, `xmir-runner`, `mlir-runner`) and the ROCm-aware shared libraries (`libMLIRRockOps`, `libMLIRRocmRuntimeLoader`, `libmlir_rocm_runtime`, `libMLIRRocmExecutionEngineUtils`) never grow a `DT_NEEDED` entry for `libamdhip64`, `libhiprtc`, `libamd_comgr`, or ROCm's monolithic `libLLVM.so`. Any such regression would defeat the entire delay-load design by re-introducing the `cl::opt` / SmallPtrSet collisions at process startup. 2. `Dialect/Rock/Loader/dynsym_only_mgpu.test` -- asserts that `libmlir_rocm_runtime.so` exports nothing but the `mgpu*` C ABI entry points required by upstream MLIR's GPU lowering. This is what the `mlir_rocm_runtime.map` version script is supposed to guarantee; the test pins that contract so an accidental new source file or transitive LLVM dep does not silently widen the dynsym. 3. `RocmRuntimeLoaderTests` (gtest) -- pins the `CoordinationPolicy::Auto` contract: the second consumer to call `loadRocmLibrary(Hip)` MUST observe the same handle as `RocmSystemDetect`, otherwise KFD's "one HSA session per process" rule starts handing out `hipErrorNoDevice`. Also covers the `Owned` policy (used by the canonical owner to break recursion) and the soft-failure semantics of `resolveRocmSymbol` on a null handle. Extends `native_arch.mlir` with a `rock.arch = "native:0"` case to cover the explicit-device-id parsing path of the `--arch native:N` runtime resolver. Adds a `linux` lit feature (set when `platform.system() == "Linux"`) for the two ELF-format tests; they `// REQUIRES: linux` so the suite stays portable to macOS / Windows configurations. Also adds a `%rocmlir_shlib_dir` lit substitution derived from the `host_ldflags` rpath so the cleanliness scripts can locate the build's library output directory without hard-coding `lib/` vs `lib64/`. The new gtest sub-suite is conditionally compiled: it requires `MLIRRocmExecutionEngineUtils` as a separate target, which only exists in the standard shared-library configuration. Under `BUILD_FAT_LIBROCKCOMPILER=ON` the symbols are not exposed separately and the sub-suite is omitted; the contract is identical in both configurations and is exercised end-to-end by the broader regression suite. Made-with: Cursor

Removes the hardcoded `libamdhip64.so.7` / `libamdhip64.so.6` / `hiprtc0700.dll` etc. SONAME lists and replaces them with a generated candidate list: 1. The bare unversioned SONAME (`libamdhip64.so` / `amdhip64.dll`) comes first. This is what `find_package(hip)`, IREE's HIP HAL and Triton's HIP loader use. Standard ROCm installs always ship the unversioned symlink, so this is the fast path on every supported configuration. 2. Numeric versioned SONAMEs (`libamdhip64.so.<MAJOR>` for MAJOR descending from `kMaxProbedRocmMajor` (99) to 1) are tried as a fallback for runtime-only deployments where the unversioned alias has been stripped. A failing `dlopen` of a missing SONAME costs ~5us on glibc, so the worst-case 99-element walk is sub- millisecond and is paid only once at startup. 3. Windows HIPRTC is decorated as `hiprtc<MM>00.dll` (e.g. `hiprtc0700.dll` for ROCm 7.0); we generate that pattern for each candidate major instead of hardcoding two majors. The same approach extends to HSA: `libhsa-runtime64.so` first, then descending numeric majors. (HSA has stayed on `.so.1` for the entire ROCm 4.x-7.x window, but iterating costs nothing if AMD ever bumps it.) Result: this loader is built once and works against any ROCm major version present at runtime -- ROCm 4.x through any future ROCm release, without code changes (so long as AMD stays at or below `kMaxProbedRocmMajor`, which is `static_assert`-ably easy to bump). Documents the version-compatibility contract on the public header under "ROCm version compatibility" so reviewers can verify the forward/backward-compatibility story without reading the implementation. Made-with: Cursor

Removes hardcoded LLVM- and rocMLIR-version filename suffixes from the cleanliness lit tests, so a future LLVM merge or rocMLIR version bump does not silently turn the assertions into no-ops. Test infrastructure: - `check_no_rocm_neededs.sh` now globs library *base names* (e.g. `libMLIRRockOps.so*`, `libmlir_rocm_runtime.so*`) and audits every match. The script reports the artefact count it actually checked, fails loudly if zero artefacts were found, and skips bare `.so` developer symlinks to avoid double-reporting the same NEEDED set. Tested against a synthesized "future LLVM 99 + rocMLIR 42" tree to confirm it picks up arbitrary version suffixes. - `check_dynsym_only_mgpu.sh` likewise globs `libmlir_rocm_runtime.so*` instead of pinning `.so.23.0git`, falling back to the unversioned dev symlink as a last resort. - `lit.site.cfg.py.in`'s `%rocmlir_shlib_dir` substitution now parses the rpath out of `host_ldflags` and falls back through four well-known build-tree library directories (`<obj>/lib`, `<obj>/lib64`, `<obj>/external/llvm-project/lib`, `<llvm_obj>/lib`) instead of hardcoding `lib`. Unit test: - Adds `RocmRuntimeLoaderVersionContract.FindsHipWithoutVersionHardcoding`, which proves the loader resolves HIP without compile-time knowledge of the ROCm major version. The test uses the `Auto` policy (the path downstream consumers actually take) to avoid KFD's "one HSA session per process" limit, which would otherwise mask a true loader failure under a spurious `dlmopen` failure from repeated `Owned` calls. Verifies that `hipGetDeviceCount` -- a symbol present on every HIP major since ROCm 1.x -- resolves through the returned handle, pinning that the loaded library is actually HIP and not some other library that happened to match the SONAME pattern. Made-with: Cursor

The header was renamed to `RocmRuntimeLoader.h` in commit e61ae4a, but three file-level docblocks still referenced the pre-rename name. They are comments only -- no code or include-path impact -- but the stale text confuses a reviewer who tries to follow the cited path. Updated in `RocmRuntimeWrappers.cpp`, `RocmSystemDetect.cpp` and the rocMLIR-side `AmdArchDb.cpp`. Made-with: Cursor

Strengthens the `CoordinationPolicy::Owned` doc comment to spell out that the policy is reserved for `RocmSystemDetect` and must NOT be called from elsewhere. Otherwise a downstream reader would see the parameter and assume it is the more direct, simpler policy to use -- when in fact a second `Owned` invocation in the same process opens HIP into a fresh `dlmopen(LM_ID_NEWLM, ...)` namespace, hits KFD's "one HSA session per process" limit, and starts handing out `hipErrorNoDevice` from every subsequent HIP call. This caveat surfaced while writing the version-agnosticism contract test: an `Owned`-using test happened to run after the canonical owner had already loaded HIP, and the loader silently produced a "working" handle that returned no devices. The test was switched to use `Auto` (the policy downstream consumers should use); the doc now blocks anyone else from making the same mistake. Made-with: Cursor

Two upstream-style polish fixes to make the loader feel native to a LLVM/MLIR reviewer: - Replace the hand-rolled `MLIR_ROCM_SHARED_HIP_EXPORT` macro (`__declspec(dllexport)` on Windows / `visibility("default")` elsewhere) with the canonical `LLVM_ALWAYS_EXPORT` from `llvm/Support/Compiler.h`. Bit-for-bit identical expansion, but the upstream macro is what reviewers expect and is consistent with how MLIR runtime wrappers (`CudaRuntimeWrappers.cpp`, etc.) publish their entry points. Updated the public docstring on `mlirRocmSystemDetectGetHipHandle` to reference the macro by name so a reader can grep its definition. - Document why `RocmRuntimeLoader.cpp` does NOT use the standard `llvm::sys::DynamicLibrary` wrapper. `sys::DynamicLibrary` always passes `RTLD_LAZY | RTLD_GLOBAL` to `dlopen` (see `lib/Support/Unix/DynamicLibrary.inc`) and there is no public knob to swap in `RTLD_LOCAL` or `dlmopen(LM_ID_NEWLM, ...)`. `RTLD_GLOBAL` is exactly the thing the loader is built to avoid: it lets the dynamic linker unify ROCm's `libLLVM.so` symbols with the host's embedded LLVM, which is the original `cl::opt` collision being fixed. The implementation below intentionally drops to raw OS APIs with the flags we actually need (`dlmopen(LM_ID_NEWLM, ..., RTLD_LAZY)` on glibc, `dlopen(..., RTLD_LAZY | RTLD_LOCAL)` on other POSIX, `LoadLibraryW` on Windows). The new file-level comment makes this deliberate deviation explicit so a future reviewer does not "fix" it by switching to `sys::DynamicLibrary`. No behavioural change. Symbol table is unchanged (`mlirRocmSystemDetectGetHipHandle` still exported with default visibility), all 4 loader unit tests still pass, the dynsym / NEEDED-cleanliness lit suite is unchanged. Made-with: Cursor

Removes accumulated indirection that the file picked up across several review-driven rewrites. No behavioural change. - Collapse 6 helper functions (`windowsLoadLibrary`, `posixOpenIsolated`, `posixOpenInSameNamespace`, `warnIfWeakIsolation`, `openIsolatedImpl`, `openInSameNamespaceImpl`) into 2 (`openIsolated`, `openInRelatedNamespace`). The two `*Impl` wrappers were dispatching to a single named impl per platform, adding a layer of named indirection without value; the platform branches now live directly inside `openIsolated`. - Drop the unreachable defensive `if (existingHandle && ...)` inside `posixOpenInSameNamespace`. The caller (now `openInRelatedNamespace`) already routes the null case to `openIsolated`, so the inner check could not fire. - Inline `warnIfWeakIsolation` into `openIsolated`'s POSIX branch -- the function existed only to guard a one-time advisory whose body and call site together cost more than the advisory itself. - Factor the bare + numeric-fallback enumeration into a small `appendCandidates(out, bare, joiner)` helper. Each `Library` case in `candidatesFor` shrinks from a 4-line `emplace_back + for-loop` block to a 3-line lambda call, and the platform- specific `joiner` lambda is the only place that knows how the versioned SONAME is composed. - Use `DEBUG_TYPE` string literal in `LLVM_DEBUG` messages instead of repeating the literal `"rocm-runtime-loader: "` prefix. The literal-concatenation idiom matches what `lib/Support/*.cpp` do. - `using namespace mlir::rocm_loader;` at file scope so the enum / struct names don't need the prefix in every `case`. Net: -28 lines, -3 helper functions, -1 redundant indirection layer, -1 unreachable code path. All 4 loader unit tests + the dynsym / NEEDED-cleanliness lit suite still pass on both build trees (default shared and `BUILD_FAT_LIBROCKCOMPILER=ON`). Made-with: Cursor

Both scripts collapse a triple-nested bash loop (artefact x forbidden-prefix x readelf-line / artefact x export-symbol x case-pattern) into a single `awk` / `grep -E` pipeline per artefact. - `check_no_rocm_neededs.sh`: 111 -> 72 lines (-35%). Forbidden SONAMEs are now a single anchored regex (`^lib(amdhip64|hiprtc|amd_comgr|LLVM\.so)`) instead of an array of prefixes consumed by an inner `case ${forb}*` (which only worked because bash unquoted-expanded `${forb}*` as a glob, an easy thing for the next reader to misunderstand). Tool/library artefact lists become inline `for` arguments. - `check_dynsym_only_mgpu.sh`: 70 -> 61 lines. The bash `case mgpu* / *` partition becomes a single `awk` script that emits an `ok=N` count and a multi-line `BAD\n sym1\n sym2` block; the bash code then just greps for `^BAD`. Verified end-to-end: success, skip-on-missing, and forbidden-NEEDED detection all still work (the latter checked by linking `/opt/rocm/lib/libamdhip64.so.7` -- which carries `libamd_comgr.so.3` in its NEEDED set -- as `libMLIRRocmRuntimeLoader.so.99` and confirming the script reports `FAIL: ... declares NEEDED libamd_comgr.so.3` and exits non-zero). Made-with: Cursor

A close re-read of the previous two simplification commits (68bb69e, 8195f8) surfaced five issues, three of them real bugs: 1. (REAL) `check_no_rocm_neededs.sh` silently passed when both directory arguments were bogus. The "no artefacts found" guard could not fire because the tool slots were always added to the `artefacts` array unconditionally, then later silently skipped by a `[ -e "${art}" ] || continue` check. A run with `tools_dir=/nonexistent shlib_dir=/nonexistent` reported "0 artefacts checked, all clean" and exited 0, masking any misconfigured lit substitution. Tools are now added only when they actually exist, mirroring the shlib-glob pattern, and the wrong-directory failure message names the lit substitutions to check. 2. (REAL) `check_no_rocm_neededs.sh` `forbidden_re` could false-positive-match SONAMEs that share a prefix but a different basename, e.g. a hypothetical `libamd_comgr_helper.so.1` would match because the previous `[._-]` separator class accepted `_` (which is part of the basename, not a version separator). Tightened the regex to anchor on `<name>.{so,dll}` followed by either end-of-string or a real version separator (`.` / `-`). Added a tabulation of true/false-positive cases to the comment so the discriminator is explicit. 3. (REAL, cosmetic) `check_dynsym_only_mgpu.sh` emitted a stray blank line in its failure output -- the `BAD\n sym1\n sym2` awk format, after `sed 's/^BAD//'`, left an empty first line. Reformatted the awk to emit `OK <sym>` and `BAD <sym>` per line, which removes the sentinel and lets the bash side use plain `grep -c` / `cut`. 4. (Code quality) Dropped the `static bool once = []() { LLVM_DEBUG(...); }();` advisory wrapper in `RocmRuntimeLoader .cpp`'s non-glibc POSIX path. `LLVM_DEBUG` is itself gated by `-debug-only=rocm-runtime-loader`, so the inner advisory already only fires when a developer has opted into the noise; suppressing it to once-per-process via a static bool is over-engineered. Replaced with a plain `LLVM_DEBUG(...)` that prints once on each load attempt under `-debug-only=`, which is exactly what every other LLVM_DEBUG advisory does. 5. (Code quality) Replaced `#if defined(__GLIBC__) && !defined(_WIN32)` with the single-condition `#if defined(__GLIBC__)` -- glibc is never defined on Windows, so the second clause was dead. Comment updated to make the implicit assertion explicit. Verification: 115/115 unit tests still pass on the default build, 111/111 on fat-lib; the cleanliness lit suite passes on both. The B1 fix exercised by `check_no_rocm_neededs.sh /nonexistent_lib /nonexistent_tools` (now correctly exits 1 with a diagnostic). The B2 fix exercised by a tabulated regex test covering 6 false positives and 10 true positives. The B3 fix exercised by injecting `/opt/rocm/lib/libamdhip64.so.7` as a fake `libmlir_rocm_runtime .so.99` (now reports each non-mgpu symbol on its own indented line without the empty-spacer artefact). Made-with: Cursor

Cleanups identified by a fresh code review pass against the LLVM Coding Standards. No behavioural change. - `RocmRuntimeLoader.h`: drop dead `<cstddef>` include (no `size_t` or `ptrdiff_t` referenced). - `RocmRuntimeLoader.cpp`: replace the file-scope `using namespace` pair (`mlir`, `mlir::rocm_loader`) with a single `namespace mlir::rocm_loader { ... }` block surrounding both the anonymous-namespace helpers and the public API. This is the convention used elsewhere in `mlir/lib/ExecutionEngine/` (`RocmSystemDetect.cpp`, `JitRunner.cpp`, ...) and avoids the second using-directive that no other TU in the directory uses. - `RocmRuntimeLoader.cpp`: drop the `auto *fn = reinterpret_cast< GetHandleFn>(...)` form for the `RTLD_DEFAULT` lookup. `auto *` with a function pointer compiles correctly but is unusual -- readers expect `auto *` to mean "pointer to object". Replaced with plain `auto fn = ...` followed by an explicit `fn ? fn() : nullptr` ternary, which keeps the call-side null check obvious. - `RocmSystemDetect.cpp`: drop `#include "llvm/Support/Error.h"` (no `Expected`/`Error`/`cantFail`/etc. used) and the dead `#define DEBUG_TYPE "execution-engine-rocm-system-detect"` (no `LLVM_DEBUG()` calls in this TU). - `external/llvm-project/mlir/lib/ExecutionEngine/CMakeLists.txt`: correct the misleading "Static library so it adds no transitive NEEDED entries" comment on `MLIRRocmRuntimeLoader`. With `BUILD_SHARED_LIBS=ON`, `add_mlir_library(... STATIC ...)` produces a `.so`; the `STATIC` keyword is honored only in fat- library configurations. Reworded to explain that either form is safe because the only transitive dependency is `LLVMSupport`. - Same CMake file: update two version-coupled comments (`libLLVM.so.23` -> `libLLVM.so.<MAJOR>`) so a future LLVM version bump does not leave stale text in the file. Also fixed a `LoadLibraryA` -> `LoadLibraryW` doc typo (the implementation has always called `LoadLibraryW`). Made-with: Cursor

Cleanups identified by a fresh code review pass. No behavioural change to the build; the lit scripts gain a new diagnostic but pass the same tests. CMake comments - Root `CMakeLists.txt`, `mlir/lib/Dialect/Rock/IR/CMakeLists.txt`: update version-coupled doc strings (`libLLVM.so.23` -> `libLLVM.so.<MAJOR>`, `libLLVMSupport.so.*` -> `libLLVMSupport.so.<MAJOR>git`) so a future LLVM bump does not leave stale references in comments. - `mlir/test/lit.site.cfg.py.in`: the comment claimed `_resolve_rocmlir_shlib_dir` probes "the two well-known candidate directories" but the implementation tries four (`<mlir_obj>/lib`, `<mlir_obj>/lib64`, `<mlir_obj>/external/llvm-project/lib`, `<llvm_obj>/lib`). Updated. - `mlir/test/Dialect/Rock/Loader/no_rocm_neededs.test`: the doc referred to `libMLIRRocmRuntimeLoader.so` as "our static delay- load helper" -- only true in fat-lib mode (in the default shared build it is a `.so`). Reworded to explain that either form is safe because the loader's transitive closure is just LLVMSupport. Also corrected "pure shell + grep" -> "awk + grep" and tightened the false-positive-rejection note. Lit scripts - `check_no_rocm_neededs.sh`, `check_dynsym_only_mgpu.sh`: split the previous `readelf | awk | grep` (resp. `nm | awk`) pipelines so that a `readelf` / `nm` failure no longer feeds empty input through the rest of the pipe and gets silently reported as "all clean". Each tool is now invoked with its exit status captured; a failure emits an explicit `FAIL: readelf -d <art> failed` (or the `nm` equivalent) and the script exits non-zero. Verified by feeding a non-ELF text file as a synthetic artefact: the script now correctly reports `FAIL: readelf -d ... failed` and exits 1, instead of the previous behaviour of silently succeeding. Made-with: Cursor

Two latent bugs surfaced by a deeper review pass. 1. (Real) The artefact globs in `check_no_rocm_neededs.sh` and `check_dynsym_only_mgpu.sh` matched `libfoo.so.*`, which happily catches debug-info companions like `libfoo.so.<version>.dwo` (split-DWARF, enabled by `LLVM_USE_SPLIT_DWARF=ON`), `libfoo.so.<version>.debug` (separate debug info) and `libfoo.so.<version>.dbg`. Feeding such a file to `readelf -d` / `nm -D` produces unhelpful output, and a `.dwo` is a stripped DWARF blob with no NEEDED entries -- it would silently report "all clean" while skipping the real shared object. Tightened the glob to `libfoo.so.[0-9]*` (the version suffix must start with a digit, which excludes `.dwo`/`.debug`/`.dbg`) and added a defensive `case "${cand}" in *.dwo|*.debug|*.dbg) continue ;; esac` for any unusual decoration that slips through. The brackets must stay outside double-quotes for bash to treat them as a character class -- documented in a comment so the next reader does not "fix" the quoting and re-introduce the bug. Verified by creating a synthetic build directory containing `libmlir_rocm_runtime.so.23.0git`, `.so.23.0git.dwo`, and `.so.23.0git.debug`: both scripts now correctly select the real `.so.23.0git` and report 25 mgpu* symbols / artefact-clean. 2. (Portability) `_resolve_rocmlir_shlib_dir` in `lit.site.cfg.py.in` parsed only the `-Wl,-rpath -Wl,/path` two-argument spelling of rpath-on-link, which is what GCC/clang emit on Linux when CMake materialises `INSTALL_RPATH`. macOS and some clang invocations emit the comma-form `-Wl,-rpath,/path` instead, which the regex missed. The function then fell through to the four-way directory probe -- on macOS that happens to find `<mlir_obj>/lib` so the test still works, but the implicit fallback is fragile. Added a second pattern for the comma-form. The first match in either form wins. Verified against four host-ldflags shapes: Linux space-form, macOS comma-form, mixed (both present), and no-rpath. Made-with: Cursor

Two correctness bugs in the `rock.arch = "native[:N]"` path, both caught by a second-pair-of-eyes review of this branch. Bug 1 (HIGH, multi-GPU correctness): `nativeArchInfo` cached queried `AmdArchInfo` by `gcnArchName`. On a same-arch multi-GPU system (e.g. two `gfx942` cards), the second device's call would hit the cache entry populated by the FIRST device and silently return its CU count, XCC count, and per-CU shared-memory -- masking real per-device variance from binning, fused-die configurations, etc. The fix keys the cache by `deviceId` (an `unsigned`, hence `llvm::DenseMap<unsigned, AmdArchInfo>`), and uses an explicit lock-acquire / find / drop-lock / query / re-acquire / `try_emplace` pattern so the expensive HIP/HSA query does not block other threads. Bug 4 (MEDIUM, parser): `parseArchString` (in `AmdArchDb.cpp`) and `ParamLookupTable<...>::normalizeArch` (in `ParamLookupTable.cpp`) both silently fell back to `deviceId = 0` on a malformed `native:<suffix>`. Inputs like `native:foo`, `native:1abc`, `native:`, or `native:-1` were treated as `native:0`, which on multi-GPU systems silently targeted the wrong device. The fixed parser distinguishes `native` (no colon, deviceId implicitly 0) from `native:` (separator with empty/invalid suffix), validates that the suffix parses as an unsigned integer that fits in `unsigned`, and `report_fatal_error`s with a precise diagnostic otherwise. Made-with: Cursor

The previous implementation of `loadHipSymbols()` / `loadHiprtcSymbols()` returned a partially-populated symbol table with `lib.handle = nullptr` when libamdhip64 / libhiprtc was missing on the loader path. The macros in `HipDelayLoadMacros.h` then dispatched call sites like `hipMalloc(...)` / `hiprtcCompileProgram(...)` straight through the (still null) function pointer, segfaulting at the call site instead of giving the user a controlled diagnostic. Tightened the contract to "either fully loaded or process aborts": both loaders now invoke a `[[noreturn]] abortMissing*` helper at the first sign of trouble (library missing, symbol missing). The helper writes a clear diagnostic (which library, which symbol) to stderr and `std::abort()`s. The tuning driver is fundamentally useless without HIP/HIPRTC -- every benchmark needs to launch a kernel on an AMD GPU -- so converting "silent crash later" into "loud diagnostic up front" is strictly an improvement. The header docstrings on `HipSymbols` / `HiprtcSymbols` / `getHipSymbols()` / `getHiprtcSymbols()` are updated to advertise the new fail-fast contract: `lib.handle` is non-null and every function pointer is non-null whenever these accessors return. Made-with: Cursor

`MLIRRocmExecutionEngineUtils` was unconditionally built and unconditionally referenced `${hip_INCLUDE_DIR}` -- but `find_package(hip)` (which populates `hip_INCLUDE_DIR`) was gated by `MLIR_ENABLE_ROCM_RUNNER`, which is forced to `0` when `BUILD_FAT_LIBROCKCOMPILER=ON`. Result: in fat-lib configurations, the target relied on an empty / leftover `hip_INCLUDE_DIR` and either failed to build or silently picked up an ambient include path from cached CMake state. Move the entire `MLIRRocmExecutionEngineUtils` definition (target, warning-flag wiring, include dirs, link libs) inside the `if (LLVM_ENABLE_PIC) ... if (MLIR_ENABLE_ROCM_RUNNER) ...` block where `find_package(hip)` already lives. The fat-lib build does not need this target at all -- `xmir-runner` is also gated off in that mode -- and `mlir/unittests/Dialect/Rock/CMakeLists.txt` already conditions its `RocmRuntimeLoaderTests` source list on `if (TARGET MLIRRocmExecutionEngineUtils)`, so the missing target in fat-lib is handled gracefully. Verified by reconfiguring + building both `build/` (default shared, `MLIR_ENABLE_ROCM_RUNNER=ON`) -- target present, libMLIRRocmExecutionEngineUtils.so produced -- and `build_migx/` (`BUILD_FAT_LIBROCKCOMPILER=ON`) -- target absent, build clean, `MLIRRockUnitTests` correctly omits `RocmRuntimeLoaderTests` and runs the rest. Made-with: Cursor

Pin the contracts behind the four bug fixes from the prior commits: - `mlir/test/Dialect/Rock/native_arch_invalid.mlir`: lit test that runs `rocmlir-opt -rock-affix-params` on a function with `rock.arch = "native:foo"` and asserts a fatal-error diagnostic. Uses `not --crash` (rather than plain `not`) because `report_fatal_error` aborts via SIGABRT, which `not` only accepts under `--crash`. The parser fails before any HIP query so the test runs anywhere (just `REQUIRES: linux` for the `not --crash` semantics). - `mlir/unittests/Dialect/Rock/AmdArchDbTests.cpp`: three new gtest cases. * `NativeArchParseTest.MalformedSuffixAborts` uses `EXPECT_DEATH` to check that `lookupArchInfo("native:foo")`, `"native:1abc"`, `"native:"`, and `"native:-1"` each abort with the precise diagnostic. Death tests do not need a real GPU since the parser fires before HIP is touched. * `NativeArchParseTest.BareNativeIsDeviceZero` confirms that `lookupArchInfo("native")` (no colon) is well-formed and does NOT abort. Skipped when no GPU is present so the call actually completes without aborting elsewhere. * `NativeArchCacheTest.SameArchMultiGpuDistinct` exercises the per-device cache contract on systems with two same-arch GPUs: two distinct device ids must each return a value from an independently-queried cache entry, never share device 0's copy. Skipped when fewer than 2 same-arch GPUs are visible (typical CI). Made-with: Cursor

Copilot

Pull request overview

This PR removes link-time dependencies on ROCm runtime libraries (notably libamdhip64 and transitively ROCm’s monolithic libLLVM.so) by introducing an isolated, delay-loading mechanism. This avoids the “two LLVMs in one process” static-initializer collision that crashes rocMLIR/MLIR tools at process startup on some ROCm installations.

Changes:

Add mlir::rocm_loader::loadRocmLibrary(...) + resolveRocmSymbol(...) and wire ROCm-aware components to resolve HIP/HIPRTC/HSA symbols at runtime (with link-map isolation via dlmopen on glibc).
Update build/linker settings (version script for mlir_rocm_runtime, --exclude-libs,ALL for tools, -Bsymbolic-functions for shared libs) to reduce symbol interposition/ODR collisions.
Add lit + gtest coverage to enforce “no ROCm DT_NEEDED” and “only mgpu* exports” contracts, and improve native-arch parsing/caching behavior.

Reviewed changes

Copilot reviewed 32 out of 33 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
mlir/utils/jenkins/static-checks/get_fat_library_deps_list.pl	Exclude certain libs from fat-archive dependency list generation.
mlir/unittests/Dialect/Rock/RocmRuntimeLoaderTests.cpp	New gtests for loader coordination and version-agnostic behavior.
mlir/unittests/Dialect/Rock/CMakeLists.txt	Conditionally add loader tests + link required targets.
mlir/unittests/Dialect/Rock/AmdArchDbTests.cpp	Refactor tests to avoid link-pulling HIP; add native parsing/cache tests.
mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp	Switch HIP calls to delay-load dispatch via macros.
mlir/tools/rocmlir-tuning-driver/HipDelayLoadMacros.h	Macro redirects for hip/hiprtc callsites through function tables.
mlir/tools/rocmlir-tuning-driver/HipDelayLoad.h	Delay-load HIP/HIPRTC API surface for tuning driver.
mlir/tools/rocmlir-tuning-driver/HipDelayLoad.cpp	Implementation of symbol resolution + fail-fast diagnostics.
mlir/tools/rocmlir-tuning-driver/CacheFlush.cpp	Use delay-loaded HIP/HIPRTC entry points.
mlir/tools/rocmlir-tuning-driver/CMakeLists.txt	Stop linking HIP/HIPRTC; use includes/defs + link runtime loader helper.
mlir/tools/rocmlir-lib/CMakeLists.txt	Document fat-lib runtime expectations for native arch.
mlir/test/lit.site.cfg.py.in	Add features/substitutions to support new loader/ELF auditing tests.
mlir/test/lit.cfg.py	Add `.test` suffix and exclude helper shell scripts from test discovery.
mlir/test/Dialect/Rock/native_arch_invalid.mlir	New lit test for invalid `native:*` arch handling.
mlir/test/Dialect/Rock/native_arch.mlir	Lit coverage for `rock.arch="native"` and `"native:0"` pipeline flow.
mlir/test/Dialect/Rock/Loader/no_rocm_neededs.test	Lit shell test asserting no forbidden ROCm runtime DT_NEEDED deps.
mlir/test/Dialect/Rock/Loader/dynsym_only_mgpu.test	Lit shell test asserting only `mgpu*` exports from runtime wrapper.
mlir/test/Dialect/Rock/Loader/check_no_rocm_neededs.sh	Helper script to audit DT_NEEDED entries via `readelf -d`.
mlir/test/Dialect/Rock/Loader/check_dynsym_only_mgpu.sh	Helper script to audit exports via `nm -D`.
mlir/lib/ExecutionEngine/CMakeLists.txt	Mark `conv-validation-wrappers` as partial sources intended.
mlir/lib/Dialect/Rock/Tuning/ParamLookupTable.cpp	Normalize `native[:N]` by resolving to hardware arch and caching.
mlir/lib/Dialect/Rock/IR/CMakeLists.txt	Remove HIP/HSA link deps; add include-only wiring and loader dependency.
mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp	Implement HIP/HSA delay-load, stricter parsing, and deviceId-keyed cache.
mlir/include/mlir/Dialect/Rock/IR/AmdArchDb.h	Expose `nativeDeviceCount()` and `nativeArchName()` APIs.
external/llvm-project/mlir/lib/ExecutionEngine/mlir_rocm_runtime.map	New linker version script to export only `mgpu*`.
external/llvm-project/mlir/lib/ExecutionEngine/RocmSystemDetect.cpp	Delay-load HIP and export shared HIP handle for coordination.
external/llvm-project/mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp	Delay-load HIP inside runtime wrappers; explicitly export mgpu APIs.
external/llvm-project/mlir/lib/ExecutionEngine/RocmRuntimeLoader.cpp	New isolated loader implementation (`dlmopen`/`dlopen`/`LoadLibraryW`).
external/llvm-project/mlir/lib/ExecutionEngine/CMakeLists.txt	Add `MLIRRocmRuntimeLoader`; remove link-time HIP deps; apply version script.
external/llvm-project/mlir/include/mlir/ExecutionEngine/RocmSystemDetect.h	Declare exported shared-HIP-handle accessor for coordination.
external/llvm-project/mlir/include/mlir/ExecutionEngine/RocmRuntimeLoader.h	Public API for ROCm delay-loading and symbol resolution.
cmake/llvm-project.cmake	Add `--exclude-libs,ALL` to tools to avoid re-exporting LLVM symbols.
CMakeLists.txt	Apply `-Bsymbolic-functions` globally for shared/module libs (non-Apple/non-Win).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T13:35:34Z

+my %excludedLibs = map { $_ => 1 } qw(
+  conv-validation-wrappers
+);
+
 my @rocmlirLibs;
 my @mlirLibs;

 my @deps = split /\n/,`ninja -t query lib/libMLIRRockThin.a`;
 not $? or die "failed to get target dependencies";

 foreach (@deps) {
  last if /outputs:/;
  if (m#external/llvm-project/llvm/lib/lib(\w+)\.a#) {
-    push @mlirLibs, $1;
+    push @mlirLibs, $1 unless $excludedLibs{$1};
  } elsif (m#lib/lib(\w+)\.a#) {
-    push @rocmlirLibs, $1;
+    push @rocmlirLibs, $1 unless $excludedLibs{$1};
  }


%excludedLibs won’t actually exclude conv-validation-wrappers here because the regexes only capture lib(\w+)\.a (\w doesn’t match -). As a result, hyphenated library names are skipped entirely and the exclusion list is misleading/ineffective. Consider widening the capture to include - (e.g. [\w-]+) so the dependency is parsed and then filtered intentionally (or adjust the parsing to match Ninja’s actual output format).

Copilot · 2026-04-24T13:35:34Z

+void appendCandidates(std::vector<std::string> &out, llvm::StringRef bare,
+                      Joiner joiner) {
+  out.emplace_back(bare.str());
+  for (unsigned m = kMaxProbedRocmMajor; m >= 1; --m)


appendCandidates uses for (unsigned m = kMaxProbedRocmMajor; m >= 1; --m), which never terminates for an unsigned loop variable (it wraps from 0 to UINT_MAX and m >= 1 remains true). This will hang when building the candidate list. Use a termination-safe form like for (unsigned m = kMaxProbedRocmMajor; m > 0; --m) or a signed counter.

Suggested change

for (unsigned m = kMaxProbedRocmMajor; m >= 1; --m)

for (unsigned m = kMaxProbedRocmMajor; m > 0; --m)

Copilot · 2026-04-24T13:35:35Z

+  if (arch.consume_front("native")) {
+    static std::mutex m;
+    static llvm::StringMap<std::string> resolvedCache;
+    std::lock_guard<std::mutex> lock(m);
+    // Re-form the original key for the cache lookup.
+    std::string cacheKey = ("native" + arch).str();
+    auto [it, inserted] = resolvedCache.try_emplace(cacheKey);
+    if (inserted) {
+      unsigned deviceId = 0;
+      if (arch.consume_front(":")) {
+        // Parse must succeed and fit in `unsigned`. Silently treating
+        // `native:foo` or `native:` as `native:0` would mask user error
+        // and target the wrong GPU on multi-GPU systems.
+        unsigned long long parsed = 0;
+        if (arch.empty() || llvm::getAsUnsignedInteger(arch, 0, parsed) ||
+            parsed > std::numeric_limits<unsigned>::max())
+          llvm::report_fatal_error(
+              Twine("Invalid `") + cacheKey +
+              "`: the suffix after `native:` must be a non-negative integer "
+              "device id (got `" +
+              arch + "`).");


normalizeArch treats any string with prefix "native" as the special native-arch syntax because it uses arch.consume_front("native"). That means inputs like "nativefoo" would be interpreted as native (device 0) instead of being rejected, which is inconsistent with parseArchString in AmdArchDb and could mask user typos. Consider checking arch == "native" or arch.starts_with("native:") (and leaving other strings to the regular gfx... parsing).

Suggested change

if (arch.consume_front("native")) {

static std::mutex m;

static llvm::StringMap<std::string> resolvedCache;

std::lock_guard<std::mutex> lock(m);

// Re-form the original key for the cache lookup.

std::string cacheKey = ("native" + arch).str();

auto [it, inserted] = resolvedCache.try_emplace(cacheKey);

if (inserted) {

unsigned deviceId = 0;

if (arch.consume_front(":")) {

// Parse must succeed and fit in `unsigned`. Silently treating

// `native:foo` or `native:` as `native:0` would mask user error

// and target the wrong GPU on multi-GPU systems.

unsigned long long parsed = 0;

if (arch.empty() || llvm::getAsUnsignedInteger(arch, 0, parsed) ||

parsed > std::numeric_limits<unsigned>::max())

llvm::report_fatal_error(

Twine("Invalid `") + cacheKey +

"`: the suffix after `native:` must be a non-negative integer "

"device id (got `" +

arch + "`).");

if (arch == "native" || arch.starts_with("native:")) {

StringRef nativeSuffix = arch;

[[maybe_unused]] bool consumed = nativeSuffix.consume_front("native");

assert(consumed && "native arch prefix must be present");

static std::mutex m;

static llvm::StringMap<std::string> resolvedCache;

std::lock_guard<std::mutex> lock(m);

// Re-form the original key for the cache lookup.

std::string cacheKey = ("native" + nativeSuffix).str();

auto [it, inserted] = resolvedCache.try_emplace(cacheKey);

if (inserted) {

unsigned deviceId = 0;

if (nativeSuffix.consume_front(":")) {

// Parse must succeed and fit in `unsigned`. Silently treating

// `native:foo` or `native:` as `native:0` would mask user error

// and target the wrong GPU on multi-GPU systems.

unsigned long long parsed = 0;

if (nativeSuffix.empty() ||

llvm::getAsUnsignedInteger(nativeSuffix, 0, parsed) ||

parsed > std::numeric_limits<unsigned>::max())

llvm::report_fatal_error(

Twine("Invalid `") + cacheKey +

"`: the suffix after `native:` must be a non-negative integer "

"device id (got `" +

nativeSuffix + "`).");

Copilot · 2026-04-24T13:35:35Z

+# Apply to both the object library (the actual compile target) and the
+# aggregate shared library, for parity with how upstream ExecutionEngine
+# wires include dirs for its HIP-touching TUs.
+foreach(_t MLIRRockOps obj.MLIRRockOps)
+  if (TARGET ${_t})
+    target_include_directories(${_t} SYSTEM PRIVATE ${hip_INCLUDE_DIRS})
+  endif()
+endforeach()
+unset(_t)


This block uses ${hip_INCLUDE_DIRS}, but elsewhere in the repo (e.g. external/llvm-project/mlir/lib/ExecutionEngine/CMakeLists.txt) the HIP package is treated as providing ${hip_INCLUDE_DIR}. If ${hip_INCLUDE_DIRS} is unset for the ROCm HIP CMake package on some installations, this will fail to add the include path and break the build. Consider switching to ${hip_INCLUDE_DIR} (or guarding/falling back to whichever variable is defined).

Suggested change

# Apply to both the object library (the actual compile target) and the

# aggregate shared library, for parity with how upstream ExecutionEngine

# wires include dirs for its HIP-touching TUs.

foreach(_t MLIRRockOps obj.MLIRRockOps)

if (TARGET ${_t})

target_include_directories(${_t} SYSTEM PRIVATE ${hip_INCLUDE_DIRS})

endif()

endforeach()

unset(_t)

if (hip_INCLUDE_DIRS)

set(_hip_include_dirs ${hip_INCLUDE_DIRS})

elseif(hip_INCLUDE_DIR)

set(_hip_include_dirs ${hip_INCLUDE_DIR})

else()

message(FATAL_ERROR

"HIP CMake package was found, but neither `hip_INCLUDE_DIRS` nor "

"`hip_INCLUDE_DIR` is defined.")

endif()

# Apply to both the object library (the actual compile target) and the

# aggregate shared library, for parity with how upstream ExecutionEngine

# wires include dirs for its HIP-touching TUs.

foreach(_t MLIRRockOps obj.MLIRRockOps)

if (TARGET ${_t})

target_include_directories(${_t} SYSTEM PRIVATE ${_hip_include_dirs})

endif()

endforeach()

unset(_t)

unset(_hip_include_dirs)

umangyadav added 30 commits April 23, 2026 17:06

umangyadav self-assigned this Apr 24, 2026

umangyadav changed the title ~~Dlopen hip~~ Lazy load HIP library to avoid LLVM versions incompatibilities Apr 24, 2026

umangyadav requested a review from Copilot April 24, 2026 13:27

Copilot started reviewing on behalf of umangyadav April 24, 2026 13:30 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Base automatically changed from upstream-merge-jan-26 to develop April 28, 2026 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy load HIP library to avoid LLVM versions incompatibilities#2357

Lazy load HIP library to avoid LLVM versions incompatibilities#2357
umangyadav wants to merge 31 commits intodevelopfrom
dlopenHip

umangyadav commented Apr 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Copilot AI Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	for (unsigned m = kMaxProbedRocmMajor; m >= 1; --m)
	for (unsigned m = kMaxProbedRocmMajor; m > 0; --m)

Conversation

umangyadav commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Architecture

Cross-library coordination

Defense-in-depth

Version compatibility

Other improvements in this PR

Notable touched files

Test Plan

New regression tests (added in this PR)

Manual / verification steps

Build configurations exercised

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

umangyadav commented Apr 24, 2026 •

edited

Loading