Skip to content

Lazy load HIP library to avoid LLVM versions incompatibilities#2357

Draft
umangyadav wants to merge 31 commits intodevelopfrom
dlopenHip
Draft

Lazy load HIP library to avoid LLVM versions incompatibilities#2357
umangyadav wants to merge 31 commits intodevelopfrom
dlopenHip

Conversation

@umangyadav
Copy link
Copy Markdown
Member

@umangyadav umangyadav commented Apr 24, 2026

Motivation

Running any of rocMLIR's compiler tools (rocmlir-driver, rocmlir-opt, xmir-runner, rocmlir-tuning-driver, MLIR's mlir-runner JIT) against a system-installed with TheRock 7.12 crashes at process startup with one of:

LLVM ERROR: Option 'default' already exists!

or, in release builds:

SmallPtrSet.h:301: Assertion 'Bucket < End' failed.

before main() is even reached.

The root cause is a static-initializer collision between two LLVMs in the same address space:

  1. rocMLIR's tools embed the in-tree libLLVMSupport.so.<MAJOR>git, etc.
  2. Linking libamdhip64 transitively pulls in libamd_comgr, which in turn pulls in ROCm's monolithic libLLVM.so.<MAJOR>.

Both LLVMs run their cl::opt static initializers; the dynamic linker unifies the cl::opt global registry across the split-vs-monolithic libraries; the second registration trips the duplicate-option assertion. The same problem also manifests in mlir-runner / xmir-runner JIT paths that dlopen mlir_rocm_runtime.so (which itself linked libamdhip64).

This PR removes every link-time dependency on libamdhip64 / libhsa-runtime64 / libhiprtc from rocMLIR-side libraries and from upstream MLIR's mlir_rocm_runtime and MLIRRocmExecutionEngineUtils, replacing them with a delay-load helper that opens the ROCm runtime in a private link-map namespace via dlmopen(LM_ID_NEWLM, ...) on glibc (and the equivalent on other platforms). The two LLVMs no longer share a symbol scope, the cl::opt registry stays single-instance, and the tools start cleanly against any ROCm install on LD_LIBRARY_PATH.

The end-to-end command that reproduced the original crash (./bin/rocmlir-driver --kernel-pipeline=migraphx,highlevel --arch gfx942 ../test.mlir) now runs without aborting.

Technical Details

The branch is 33 commits, 15 of them tagged [EXTERNAL] for upstream LLVM submission. Net diff: +2618 / −561 lines across 46 files.

Architecture

A new public API in mlir::rocm_loader::loadRocmLibrary(...) (header mlir/ExecutionEngine/RocmRuntimeLoader.h, implementation in lib/ExecutionEngine/RocmRuntimeLoader.cpp, built as a small MLIRRocmRuntimeLoader library) opens HIP / HIPRTC / HSA without ever bringing them into the host's symbol scope:

  • glibc: dlmopen(LM_ID_NEWLM, ..., RTLD_LAZY) — fresh link-map namespace; ROCm's libLLVM cannot interpose ours.
  • Other POSIX (musl, ...): dlopen(..., RTLD_LAZY | RTLD_LOCAL) with a one-time LLVM_DEBUG advisory that isolation is incomplete and the host must hide its own LLVM exports.
  • Windows: LoadLibraryW (DLLs have private scopes per-DLL natively); SONAME is converted UTF-8 → UTF-16 to match llvm/lib/Support/Windows/DynamicLibrary.inc.

Cross-library coordination

KFD (the AMD kernel-fusion-driver) only allows one user-space HSA session per process. Multiple consumers (AmdArchDb for rock.arch="native", mlir_rocm_runtime's JIT wrappers, rocmlir-tuning-driver's benchmark loop) must therefore share one HIP handle. RocmSystemDetect.cpp is the canonical owner and exports extern "C" LLVM_ALWAYS_EXPORT void *mlirRocmSystemDetectGetHipHandle(). Other consumers' loadRocmLibrary(Hip, Auto) looks this symbol up via RTLD_DEFAULT and reuses the returned handle. HIPRTC and HSA load into HIP's link-map namespace via the loader's relatedHandle parameter, so all three share the same KFD session.

Defense-in-depth

  • libmlir_rocm_runtime.so gets a version script (mlir_rocm_runtime.map) that exports only the mgpu* C ABI entry points. nm -D confirms 25 mgpu* symbols, nothing else.
  • All rocMLIR-produced shared libraries get -Wl,-Bsymbolic-functions so intra-library cl::* calls bind to in-library definitions and cannot be interposed by a later-loaded libLLVM.
  • All rocMLIR tools (rocmlir-driver, rocmlir-opt, ...) get LINKER:--exclude-libs,ALL so they do not re-export LLVM statics from libLLVMSupport.a.

Version compatibility

The loader is intentionally version-agnostic: it tries the unversioned SONAME first (libamdhip64.so / amdhip64.dll, the standard find_package(hip) resolution), then iterates libamdhip64.so.<MAJOR> for descending MAJOR up to a generous kMaxProbedRocmMajor = 99. There is no compile-time floor or ceiling on the ROCm version this code supports — ROCm 4.x through any future release works with no source changes.

Other improvements in this PR

While auditing the affected paths, the branch also fixes a handful of pre-existing latent bugs and missing test coverage:

  • AmdArchDb::nativeArchInfo cache key was gcnArchName, which would silently return device 0's CU/XCC/per-CU-shared-mem data for every later device on a same-arch multi-GPU system. Changed to deviceId (llvm::DenseMap<unsigned, AmdArchInfo>).
  • parseArchString / ParamLookupTable::normalizeArch silently treated native:foo, native:1abc, native: as native:0. Now report_fatal_errors with a precise diagnostic.
  • rocmlir-tuning-driver's HipDelayLoad previously returned a partially-initialized symbol table on missing libamdhip64 / libhiprtc; the macros in HipDelayLoadMacros.h then dispatched through null function pointers and segfaulted. Tightened the contract to "fully loaded or process aborts" via [[noreturn]] abortMissing* helpers.
  • MLIRRocmExecutionEngineUtils CMake gating consumed ${hip_INCLUDE_DIR} outside the if(MLIR_ENABLE_ROCM_RUNNER) gate that ran find_package(hip). In BUILD_FAT_LIBROCKCOMPILER=ON builds (which force the runner off) this would silently fall through to ambient state. Moved the entire target definition inside the gate.

Notable touched files

Area Files
Loader (upstream LLVM) external/llvm-project/mlir/include/mlir/ExecutionEngine/RocmRuntimeLoader.h, lib/ExecutionEngine/RocmRuntimeLoader.cpp, RocmSystemDetect.{h,cpp}, RocmRuntimeWrappers.cpp, mlir_rocm_runtime.map, lib/ExecutionEngine/CMakeLists.txt
rocMLIR consumers mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp, mlir/lib/Dialect/Rock/IR/CMakeLists.txt, mlir/lib/Dialect/Rock/Tuning/ParamLookupTable.cpp, mlir/tools/rocmlir-tuning-driver/{HipDelayLoad.{h,cpp},HipDelayLoadMacros.h,CMakeLists.txt,CacheFlush.cpp,rocmlir-tuning-driver.cpp}, mlir/tools/rocmlir-lib/CMakeLists.txt
Build infrastructure CMakeLists.txt (root), cmake/llvm-project.cmake, mlir/utils/jenkins/static-checks/get_fat_library_deps_list.pl
Tests mlir/unittests/Dialect/Rock/{AmdArchDbTests.cpp,RocmRuntimeLoaderTests.cpp,CMakeLists.txt}, mlir/test/Dialect/Rock/{native_arch.mlir,native_arch_invalid.mlir,Loader/*}, mlir/test/{lit.cfg.py,lit.site.cfg.py.in}

Test Plan

New regression tests (added in this PR)

Test Type Pins
Dialect/Rock/Loader/no_rocm_neededs.test lit / shell No shipped binary or ROCm-aware shared library has libamdhip64, libhiprtc, libamd_comgr, or ROCm's monolithic libLLVM.so in its DT_NEEDED set. Audits 8 artefacts (default build) / 4 (fat-lib).
Dialect/Rock/Loader/dynsym_only_mgpu.test lit / shell libmlir_rocm_runtime.so exports nothing but mgpu* entry points (the version-script contract).
Dialect/Rock/native_arch.mlir (extended) lit rock.arch = "native" and "native:0" both flow through the rock pipeline and produce the same lowered IR as a kernel pinned to the concrete arch.
Dialect/Rock/native_arch_invalid.mlir (new) lit rock.arch = "native:foo" aborts with the precise diagnostic via not --crash.
RocmRuntimeLoaderTests (4 cases) gtest Auto policy reuses RocmSystemDetect's HIP handle (proves the RTLD_DEFAULT cross-library coordination); Owned policy returns a usable handle; resolveRocmSymbol is null-safe; loader resolves HIP without compile-time version knowledge.
NativeArchParseTest (2 cases) gtest / death tests Four malformed native:* inputs each abort with the precise diagnostic; bare native is well-formed.
NativeArchCacheTest.SameArchMultiGpuDistinct gtest Per-device cache contract on multi-GPU systems (skipped when fewer than 2 same-arch GPUs visible).

Manual / verification steps

  • The original failing command runs end-to-end: bin/rocmlir-driver --kernel-pipeline=migraphx,highlevel --arch gfx942 ../test.mlir.
  • bin/rocmlir-gen -p -t f32 --arch gfx942 -mfma=on | bin/rocmlir-driver --kernel-pipeline=full --arch gfx942 produces a valid gpu.binary on both build configurations.
  • nm -D libmlir_rocm_runtime.so | grep -v '^mgpu' returns nothing.
  • readelf -d on every shipped binary / shared library confirms no libamdhip64, libhiprtc, libamd_comgr, or ROCm's libLLVM.so in NEEDED.
  • Synthesized "future LLVM 99 + ROCm 99" tree (symlinks pointing the build's artefacts at fake-version filenames) confirms the version-agnostic globs and SONAME fallback still work.
  • Failure-injection: a non-ELF text file passed as a synthetic artefact correctly trips the readelf -d ... failed / nm -D ... failed diagnostics.

Build configurations exercised

  • Default shared build: cmake -G Ninja .. -DCMAKE_BUILD_TYPE=RelWithDebInfo (default BUILD_SHARED_LIBS=ON, MLIR_ENABLE_ROCM_RUNNER=ON).
  • Fat-library build: cmake -G Ninja .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_FAT_LIBROCKCOMPILER=On (forces BUILD_SHARED_LIBS=OFF, MLIR_ENABLE_ROCM_RUNNER=0). Used by the AMDMIGraphX integration.

Test Result

Build Suite Result
Default (shared) ninja check-rocmlir (1559 tests) 1347 passed, 0 failed (207 unsupported, 5 expectedly failed)
Default (shared) MLIRRockUnitTests (118 cases) 118 / 118 pass
Fat-lib ninja check-rocmlir (933 tests) 478 passed, 0 failed (452 unsupported, 3 expectedly failed)
Fat-lib MLIRRockUnitTests (114 cases) 114 / 114 pass
nm/readelf cleanliness audit (default) 8 artefacts All clean — no forbidden NEEDED, only mgpu* exported
nm/readelf cleanliness audit (fat-lib) 4 artefacts All clean
clang-format on touched files clean
ReadLints on touched files no errors

Submission Checklist

`MLIRRockOps` linked `hip::host`/`hip::amdhip64` (and `hsa-runtime64`)
directly so it could query GPU properties for `--arch native:N`. On
recent ROCm, `libamdhip64` transitively depends on `libamd_comgr.so`,
which in turn depends on ROCm's `libLLVM.so.23`. Any binary that embeds
rocMLIR's own LLVM (e.g. `rocmlir-driver`, `rocmlir-opt`) therefore
ended up with two LLVMs in the same address space. Their `cl::opt`
static initializers fight over the global option registry and the
process aborts with `LLVM ERROR: Option '...' already exists!`.

Confine HIP/HSA to a tiny standalone shared library
(`mlir_rocm_arch_runtime`) and load it lazily from `AmdArchDb.cpp` via
`dlmopen(LM_ID_NEWLM, ...)` on glibc / `LoadLibraryW` on Windows. The
fresh dynamic-linker namespace keeps ROCm's `libLLVM.so` from snapping
onto rocMLIR's embedded LLVM, so duplicate `cl::opt` registration can
no longer occur. Tools that never request `--arch native:N` never
load the runtime and stay completely free of HIP/comgr/LLVM-23.

Highlights:
- New `mlir/lib/ExecutionEngine/RocmArchRuntime.{h,cpp}`: stable C ABI
  (`mlirRocmArchRuntimeAbiVersion`, `mlirRocmArchRuntimeDeviceCount`,
  `mlirRocmArchRuntimeGetProperties`) implemented on top of HIP+HSA.
  Built with plain `add_library(... SHARED)` so it does NOT pull in
  `LLVMSupport`.
- `MLIRRockOps` drops all HIP/HSA link-time deps. New runtime loader
  in `AmdArchDb.cpp` searches `<exe-dir>`, `<exe-dir>/../lib/`,
  `ROCMLIR_BUILD_RUNTIME_DIR` (compile-time path for nested test
  binaries), then the system loader path.
- Public helpers `nativeDeviceCount()` and `nativeArchName()` added so
  callers (unit tests, Python bindings) can use the runtime without
  including HIP headers.
- `ParamLookupTable::normalizeArch` learns to canonicalise `native[:N]`
  by routing through the runtime, so `rock.arch = "native"` flows
  through the full backend pipeline end-to-end.
- `mlir/lib/CMakeLists.txt` wires `add_dependencies(MLIRRockOps
  mlir_rocm_arch_runtime)` after both subdirectories are processed,
  and exposes `ROCMLIR_BUILD_RUNTIME_DIR` as a compile def pointing at
  `LIBRARY_OUTPUT_DIRECTORY` (POSIX `.so`) or `RUNTIME_OUTPUT_DIRECTORY`
  (Windows `.dll`).
- Fat-build (`BUILD_FAT_LIBROCKCOMPILER=ON`): the runtime stays as a
  separate shared library (intentionally NOT bundled into
  `librockCompiler.a`) and is excluded from
  `librockcompiler_deps.cmake` via an explicit allowlist filter in
  `get_fat_library_deps_list.pl`.

Tests:
- `MLIRRockUnitTests::AmdArchDbTests` rewritten to use the public
  AmdArchDb API; no longer includes `hip/hip_runtime_api.h` and is
  built unconditionally on all platforms.
- New `mlir/test/Dialect/Rock/native_arch.mlir` lit test exercising
  `rock.arch = "native"` through `-rock-affix-params`. Gated on a new
  `amd-gpu-present` lit feature so it skips on no-GPU CI machines.

Validated:
- Original repro `rocmlir-driver --kernel-pipeline=migraphx,highlevel
  --arch gfx942` produces output identical to baseline.
- `rock.arch = "native"` end-to-end through `--kernel-pipeline=full`
  emits a `gpu.binary` with `chip = "gfx942"`.
- `ldd` of `MLIRRockOps.so`, `rocmlir-driver`, `rocmlir-opt`, and
  `MLIRRockUnitTests` is clean: no `libamdhip64`, no `libamd_comgr`,
  no `libLLVM.so.23`, no `libclang-cpp.so.23`.
- Unit tests: 111/111 pass; 8 NativeArch tests run against real MI300X
  hardware and match presets.
- Rock lit suite (excl. `integration/`): 102/102 pass.
- Fat build configured with locally-installed rocm-cmake builds the
  static `librockCompiler.a` plus the separate `mlir_rocm_arch_runtime.so`
  and reproduces both the original `gfx942` case and the
  `rock.arch = "native"` case.

Made-with: Cursor
When a process embeds rocMLIR's LLVM and also loads libmlir_rocm_runtime.so
(e.g. mlir-runner --shared-libs=libmlir_rocm_runtime.so), the runtime
wrapper brings ROCm's libamdhip64 into the address space. libamdhip64
transitively pulls /opt/rocm/lib/llvm/lib/libLLVM.so.23 via libamd_comgr,
and the dynamic linker unifies llvm::cl::SubCommand::getTopLevel() and the
rest of the cl::opt global registry across both LLVM instances. Static
initializers in ROCm's LLVM then race with rocMLIR's LLVM and the process
aborts with "LLVM ERROR: Option '...' already exists!".

Hide everything that is not mgpu* from libmlir_rocm_runtime.so's dynsym:

  * CXX_VISIBILITY_PRESET=hidden and VISIBILITY_INLINES_HIDDEN=ON so
    non-annotated symbols default to STV_HIDDEN at compile time.
  * -Wl,--exclude-libs,ALL so symbols pulled in from any static archive
    (hip device runtime, clang-rt, libcxx internals) are made STB_LOCAL
    at link time even if they were compiled with default visibility.
  * -Wl,--version-script=mlir_rocm_runtime.map to assert the contract
    at link time: only mgpu* may be exported. Anything else fails the
    link rather than silently re-exposing LLVM helpers.

Windows is already default-hidden for DLL symbols, so the guard is a
no-op there. Apple's ld does not support GNU version scripts, so it is
also skipped.

Made-with: Cursor
Now that libmlir_rocm_runtime.so hides its LLVM symbols at link time
(preceding commit), the `cl::opt` collision that motivated a separate,
dlmopen-isolated shim library no longer manifests. Fold the HIP/HSA
delay-loader back into MLIRRockOps itself using plain dlopen (POSIX) /
LoadLibraryW (Windows).

Why this is safe without dlmopen:
  * MLIRRockOps does not link libamdhip64 / libhsa-runtime64; the shared
    objects are resolved at run time by SONAME only, so the dynamic linker
    only searches for them when `rock.arch = "native[:N]"` is asked for.
  * When HIP is loaded this way, RTLD_LOCAL keeps its transitive deps
    (including ROCm's libLLVM.so via libamd_comgr) in HIP's own scope:
    they are not re-exposed to us, and they will not satisfy undefined
    references from rocMLIR's embedded LLVM.
  * rocMLIR binaries embedding LLVM do not re-export cl:: state through
    their dynsym either (the LLVM libraries are built with default
    -fvisibility-inlines-hidden), so even when HIP's libLLVM.so mmap's
    into the process, its static initializers query the rocMLIR-side
    cl::opt globals through HIP-local symbol resolution and never see
    rocMLIR's own registry. The two LLVMs stay disjoint.

Why in-process delay-load is better than the shim:
  * No separate install artifact for downstream consumers (MIOpen,
    MIGraphX) to ship; a ROCm install visible on LD_LIBRARY_PATH / RPATH
    is sufficient.
  * No ABI boundary to version (the shim had MLIR_ROCM_ARCH_RUNTIME_ABI_VERSION
    plus a POD properties struct); fewer moving parts.
  * No dlmopen, which is a glibc-specific extension with known gotchas
    (TLS, static destructors, fork()) and no counterpart on musl.

Details:
  * AmdArchDb.cpp: two tiny lazy loaders (HipRuntime, HsaRuntime) resolve
    the few symbols we need (hipGetDeviceCount,
    hipGetDevicePropertiesR0600 with a fallback to hipGetDeviceProperties,
    hsa_init, hsa_iterate_agents, hsa_agent_get_info). SONAME candidates
    cover ROCm 6 and 7 and unversioned installs. The HIP and HSA headers
    are included only for POD layouts; they are not link-time deps.
  * CMakeLists.txt for MLIRRockOps now does `find_package(hip QUIET)` and
    only picks up include dirs; no hip::host / hip::amdhip64 link. HSA
    headers are likewise include-only.
  * Navi WGP-as-CU correction for shared-memory per CU runs before
    checkAndSetInfo() so the HSA-assisted value flows through the same
    code path as before.
  * The mlir_rocm_arch_runtime target, RocmArchRuntime.{h,cpp},
    ROCMLIR_BUILD_RUNTIME_DIR compile def, and the Perl allowlist for
    the fat-lib Perl filter are all removed.
  * Unit test and lit test comments are updated to match the new loader
    path. Their behaviour is unchanged.

Verified: readelf -d build/lib/libMLIRRockOps.so.2.0 shows no NEEDED
entries for libamdhip64, libhsa-runtime64, libamd_comgr, or
libLLVM.so.*. The full chain will be re-validated in the next commit.

Made-with: Cursor
Plain `dlopen(RTLD_LAZY | RTLD_LOCAL)` is insufficient to break the
`cl::opt` collision against ROCm's libLLVM.so.23 in practice. Even
though RTLD_LOCAL stops us from re-exporting symbols from the dlopened
module, it does not hide rocMLIR's own process-global LLVM symbols from
the dynamic linker when libLLVM.so.23 runs its static initializers.

On glibc, libLLVM.so.23 is pulled in transitively by libamd_comgr, and
its static-init cl::opt registrations resolve against rocMLIR's
unversioned cl::* symbols, which registers the same option twice in
rocMLIR's registry. `rocmlir-opt -rock-affix-params` on
mlir/test/Dialect/Rock/native_arch.mlir reproduces this as:

  LLVM ERROR: Option 'default' already exists!
  ...
  libLLVMCodeGen.so.22.0git  llvm::report_fatal_error(...)
  libLLVM.so.23.0git         <static init>
  ld-linux-x86-64.so.2       _dl_catch_exception

dlmopen(LM_ID_NEWLM, ...) places libamdhip64 and its transitive deps
into a fresh link-map namespace where rocMLIR's symbols are not
visible, so libLLVM.so.23's static init can only resolve its cl::*
references against its own copies. The two LLVM instances stay
completely disjoint.

The non-glibc POSIX fallback keeps plain dlopen + RTLD_LOCAL. That
branch only works if the process also hides its LLVM surface via
build-side discipline (`-Wl,--exclude-libs,ALL`); see the follow-up
commits that apply that to rocMLIR executables. Windows is unchanged:
LoadLibraryW already gives a per-DLL private scope.

Verified:
  * mlir/test/Dialect/Rock/native_arch.mlir passes (was crashing).
  * Original repro (rocmlir-driver --kernel-pipeline=migraphx,highlevel
    --arch gfx942) still passes.
  * `rock.arch = "native"` via MLIR attribute still produces a valid
    gpu.binary.

Made-with: Cursor
Add -Wl,--exclude-libs,ALL to add_rocmlir_tool(). Any static archive that
gets pulled into a tool (e.g. libLLVMSupport.a in the fat-lib build, or
intermediates on a conventional static build) is turned into STB_LOCAL by
the linker rather than being re-exported from the executable's dynsym.

This is load-bearing on the non-dlmopen branches of AmdArchDb.cpp and on
musl / FreeBSD where dlmopen is unavailable: if the tool re-exports
llvm::cl::* globals, a later dlopen of libamd_comgr / libLLVM.so.23 will
interpose against them and crash with "Option '...' already exists" at
static-init time.

The flag is a no-op when every LLVM/MLIR dependency is already a shared
library (the BUILD_SHARED_LIBS=ON default), so landing it now does not
change the shape of the built tools on this developer machine. It still
measurably reduces their dynamic symbol surface: dynsym on rocmlir-opt
drops from ~34 template weak symbols to 1, making future regressions
easier to spot with `nm -D`.

Apple's linker does not accept --exclude-libs; the guard matches the one
used for the libmlir_rocm_runtime.so version script.

Made-with: Cursor
The preceding commit paired --version-script=mlir_rocm_runtime.map with
CXX_VISIBILITY_PRESET=hidden. The combination over-hides: the mgpu*
entry points in upstream RocmRuntimeWrappers.cpp are not annotated with
__attribute__((visibility("default"))), so the hidden preset marks them
STV_HIDDEN at compile time and the version script has nothing to export
(`global: mgpu*;` only overrides global/local partitioning, it does not
un-hide compile-time-hidden symbols).

The result is a libmlir_rocm_runtime.so with an empty dynsym, which
breaks any mlir-runner / xmir-runner invocation that tries to dlsym an
mgpu* entry point for JIT launches.

Fix: use the version script alone and rely on default visibility. The
script still hides everything that is not mgpu*, which is all we need
to stop llvm::* leakage. Upstream CudaRuntimeWrappers.cpp solves the
same problem with explicit MLIR_CUDA_WRAPPERS_EXPORT annotations on
each entry point; we avoid that here to keep the rocMLIR-side patch
minimal.

Verified: nm -D --defined-only on libmlir_rocm_runtime.so now shows
exactly the 25 mgpu* T entries and zero llvm::* symbols.

Made-with: Cursor
Set CMAKE_SHARED_LINKER_FLAGS / CMAKE_MODULE_LINKER_FLAGS to include
-Wl,-Bsymbolic-functions for the entire build on non-Windows, non-Apple
targets. This affects both rocMLIR's own libMLIR*.so and the embedded
LLVM's libLLVM*.so.22.0git.

Effect: intra-library cl::* function calls (e.g. within
libLLVMSupport.so.22.0git, libLLVMCodeGen.so.22.0git, ...) are resolved
to local definitions at link time rather than going through the PLT.
The dynamic linker cannot later interpose them with any other loaded
shared object's cl::* implementation, in particular ROCm's
libLLVM.so.23.0git (pulled in transitively via
libamdhip64 -> libamd_comgr when a runner dlopens
libmlir_rocm_runtime.so).

Verified with `objdump -d --section=.plt` on
libLLVMSupport.so.22.0git: zero cl::* PLT entries after this change.

Scope caveat: -Bsymbolic-functions addresses function-symbol interposition
only. Data-symbol interposition (vtables, the cl::SubCommand
process-global singleton) is unaffected, so the mlir-runner /
xmir-runner JIT path still hits the "Option '...' already exists!"
collision during static-init. Those tests remain pre-existing upstream
MLIR failures; closing them requires either a version script hiding
llvm::cl::* from the dynsym of libLLVMSupport.so.* or a full
CXX_VISIBILITY_PRESET=hidden on the embedded LLVM build. Both are
tracked as follow-ups.

Regression guard: 102/102 rock lit tests (excl. integration/) and
111/111 MLIRRockUnitTests still pass with this flag active.

Made-with: Cursor
…mopen

Hiding cl::* from libmlir_rocm_runtime.so's dynsym (commit 8010fdf) was
necessary but not sufficient: the runtime still pulled libamdhip64 in as
a direct NEEDED, which transitively dragged libamd_comgr and ROCm's
libLLVM.so.23 into mlir-runner's / xmir-runner's main namespace at
dlopen time. ROCm's libLLVM.so.23 then ran static initializers that
unified the cl::SubCommand singleton across the two LLVMs, aborting
JIT execution with "Option 'default' already exists!".

Apply the same isolation pattern that mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp
already uses for the rocMLIR compiler side:

  * Remove `hip::host hip::amdhip64` from `mlir_rocm_runtime`'s link
    line. Keep the HIP headers reachable via `hip_INCLUDE_DIR` so we can
    still see hipDeviceProp_t / hipMemcpyKind layouts.
  * Resolve every hipXXX entry point through a function-pointer table
    populated by dlmopen(LM_ID_NEWLM, "libamdhip64.so.7", RTLD_LAZY) on
    glibc, plain dlopen elsewhere on POSIX, and LoadLibraryA on Windows.
  * Bare hipXXX call sites are kept verbatim by routing them through
    object-like macros (`#define hipModuleLoadData(...)
    (::getHip().moduleLoadData(__VA_ARGS__))` etc.); the rest of the
    file stays byte-for-byte identical to upstream.
  * mgpu* entry points pick up explicit MLIR_HIP_WRAPPERS_EXPORT
    visibility annotations so the existing libmlir_rocm_runtime.so
    version script keeps exporting them after the next merge from
    upstream (where default-visibility was the only thing exporting
    them).

Effect: `readelf -d build/.../libmlir_rocm_runtime.so` no longer lists
libamdhip64 as NEEDED. ROCm's libamd_comgr and libLLVM.so.23 land in
the private dlmopen namespace and are completely invisible to the host
process's LLVM.

Verified end-to-end:
  * mlir/test/rocmlir-driver/conv2d_harness/conv2d_harness.mlir (and
    siblings that previously crashed in mlir-runner): PASS.
  * Rock integration suite: 111/115 pass, up from 102/115. The
    remaining 4 failures (rock-reduce-max-case{1..4}.mlir) abort with
    SmallPtrSet "Bucket < End", an unrelated downstream bug in the
    reduce-max lowering -- they pre-date this branch.
  * Original gfx942 repro and rock.arch="native" attribute path still
    pass.
  * 102/102 rock lit (excl. integration) and 111/111 MLIRRockUnitTests
    still pass.

Made-with: Cursor
…_rocm_runtime

`xmir-runner` link-pulls libMLIRRocmExecutionEngineUtils.so for
RocmSystemDetect, which in turn link-pulled libamdhip64 -> libamd_comgr
-> ROCm's libLLVM.so.23. The chain was loaded at process start before
main() ran, and libLLVM.so.23's static initializers then unified
cl::SubCommand against rocMLIR's embedded libLLVMSupport.so.22, aborting
in `_dl_init` with a SmallPtrSet "Bucket < End" assertion (the iterator
walked off the end of a SmallPtrSet that the two LLVMs disagreed about).

Fix:

1. Drop the link-time `hip::host hip::amdhip64` dependency from
   MLIRRocmExecutionEngineUtils. Only the HIP headers are kept, via
   `hip_INCLUDE_DIR`. RocmSystemDetect.cpp now resolves the two HIP
   entry points it actually uses (`hipGetDeviceCount`,
   `hipGetDevicePropertiesR0600`) through a dlmopen(LM_ID_NEWLM, ...)
   handle, mirroring the pattern in
   mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp and the preceding patch to
   libmlir_rocm_runtime.so.

2. Coordinate with libmlir_rocm_runtime.so so we never dlmopen
   libamdhip64 twice in the same process. KFD allows only one
   user-space HSA session per process; a second dlmopen lands in a
   fresh link-map namespace, gets its own HIP/HSA copy, and every
   call from that copy returns hipErrorNoDevice.

   RocmSystemDetect runs first (during JitRunner's mlirTransformer
   step, before --shared-libs are dlopen'd) and is therefore the
   canonical owner of the HIP namespace. It exports a small C ABI
   `mlirRocmSystemDetectGetHipHandle()` returning its dlmopen'd
   handle. RocmRuntimeWrappers.cpp dlsyms that symbol via RTLD_DEFAULT
   when its `mgpu*` table is being built, and reuses the handle if
   present. If the symbol is absent (host binary that does not link
   RocmSystemDetect), it transparently falls back to its own dlmopen.

Verified end-to-end:
  * `xmir-runner` ldd is now clean -- no libamdhip64, no libamd_comgr,
    no libLLVM.so.23 in the transitive closure.
  * mlir/test/Dialect/Rock integration suite: 115/115 pass (was
    102/115; the 4 reduce_max + ~9 other JIT tests are now green).
  * mlir/test/xmir suite: 59/61 pass, 1 XFAIL, 1 FAIL. The remaining
    FAIL (bert_part_48) is "Error loading library: ocml.bc: Unknown
    attribute kind (106) (Producer: 'LLVM23.0.0git' Reader: 'LLVM
    22.0.0git')" -- an ROCm-vs-rocMLIR LLVM-IR version skew, not
    related to this work.
  * mlir/test/{Dialect,Conversion,CAPI,rocmlir-*,migraphx_models,
    xmir,mlir-rock-lib} broader suite: 368/380 (96.84%) pass, was
    300/380. The remaining failures are 1 bert_part_48 above and 1
    rocmlir-tuning-driver/benchmark-config.mlir; the latter has the
    same root cause but lives in a separate tool that direct-links
    ~30 HIP/HIPRTC entry points and needs its own architectural
    refactor.
  * 102/102 rock lit (excl. integration) and 111/111 MLIRRockUnitTests
    still pass.
  * Original repro (`rocmlir-driver --kernel-pipeline=migraphx,highlevel
    --arch gfx942`) and `rock.arch="native"` both pass.

Made-with: Cursor
`rocmlir-tuning-driver` link-pulled `hip::host hip::amdhip64
hiprtc::hiprtc` directly. The transitive chain `libamdhip64 ->
libamd_comgr -> libLLVM.so.<major>` made the dynamic linker map ROCm's
monolithic libLLVM into the host process at startup. ROCm's libLLVM ran
its `cl::opt` static initializers before main(), unifying its
`cl::SubCommand` registry with rocMLIR's split-shared-libs
libLLVMSupport, and the resulting iterator inconsistency aborted the
process in `_dl_init` with "SmallPtrSet: Bucket < End". This affected
the only previously-failing lit test:
mlir/test/rocmlir-tuning-driver/benchmark-config.mlir.

Resolve every hipXXX/hiprtcXXX entry point at run time instead:

  * Add `HipDelayLoad.{h,cpp}` (rocmlir::tuningdriver namespace) that
    exposes `getHipSymbols()` / `getHiprtcSymbols()` accessors.
    `HipDelayLoad.cpp` resolves libamdhip64 / libhiprtc via
    `dlmopen(LM_ID_NEWLM, ...)` on glibc, plain `dlopen(RTLD_LAZY |
    RTLD_LOCAL)` on other POSIX, and `LoadLibraryA` on Windows.

  * Coordinate with libMLIRRocmExecutionEngineUtils via the existing
    `mlirRocmSystemDetectGetHipHandle` cross-library back-channel: if
    RocmSystemDetect has already loaded HIP into a private namespace,
    reuse its handle. KFD permits exactly one user-space HSA session
    per process, so a second dlmopen would otherwise return
    `hipErrorNoDevice` for every call. HIPRTC is loaded into the same
    link-map namespace as HIP via `dlinfo(RTLD_DI_LMID)` so the JITed
    instruction-cache flush kernel sees the same device handles.

  * Add `HipDelayLoadMacros.h`, included LAST in each consumer .cpp to
    redirect bare `hipMalloc(...)` / `hiprtcCompileProgram(...)` etc.
    to the function-pointer table dispatch. This keeps
    rocmlir-tuning-driver.cpp and CacheFlush.cpp byte-equivalent to
    their upstream-style HIP code; only the include block at the top
    grows by two lines per file. The `hipExtModuleLaunchKernel` macro
    injects the trailing `flags=0` argument that the HIP header would
    otherwise supply as a default (default arguments do not apply when
    calling through a function pointer).

  * CMake: drop `target_link_libraries(... hip::host hip::amdhip64
    hiprtc::hiprtc)`; keep the find_package calls and switch to
    include-only wiring via `hip_INCLUDE_DIR` /
    `hiprtc::hiprtc INTERFACE_INCLUDE_DIRECTORIES` plus the
    `__HIP_PLATFORM_AMD__=1` define. Add `${CMAKE_DL_LIBS}` for the
    dlmopen/dlsym calls in HipDelayLoad.cpp.

  * CMake: drop the `benchmark-driver-utils` link-line entry. It was
    dead code -- rocmlir-tuning-driver does not include
    `benchmarkUtils.h` or call any `benchmark::*` symbol. Keeping it
    pulled libamdhip64 in transitively (via `target_link_libraries(
    benchmark-driver-utils PUBLIC hip::host hip::amdhip64)`), so its
    removal was load-bearing for breaking the HIP transitive chain.

Verified end-to-end on the dlopenHip branch over LLVM 23:
  * `readelf -d build/bin/rocmlir-tuning-driver` is now clean -- no
    libamdhip64, libamd_comgr, libhiprtc, or libLLVM.so.23 in the
    direct or transitive closure.
  * mlir/test/rocmlir-tuning-driver/benchmark-config.mlir now PASSES.
  * Full broad lit (Dialect, Conversion, CAPI, rocmlir-*, xmir,
    migraphx_models, mlir-rock-lib): 371/381 pass, 2 XFAIL, 8
    Unsupported, 0 failures (was 370/381 + 1 fail).
  * rock integration suite: 115/115 pass.
  * xmir suite: 60/61 pass + 1 XFAIL.
  * MLIRRockUnitTests: 111/111 pass.
  * Original repro
    `rocmlir-driver --kernel-pipeline=migraphx,highlevel --arch gfx942`
    still passes; `rock.arch="native"` still passes.

Made-with: Cursor
…helper

Header-only utility that consolidates the dlmopen / dlopen / LoadLibraryA
scaffolding that was independently re-implemented in four places on the
dlopenHip branch:

  * external/.../ExecutionEngine/RocmRuntimeWrappers.cpp
  * external/.../ExecutionEngine/RocmSystemDetect.cpp
  * mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp
  * mlir/tools/rocmlir-tuning-driver/HipDelayLoad.cpp

Each had its own `kHipCandidates` SONAME list, `osOpen`/`osSym` shim,
and (in three of the four) its own copy of the
`mlirRocmSystemDetectGetHipHandle` RTLD_DEFAULT-coordination dance.

Provides:

  * `loadRocmLibrary(Library, relatedHandle, CoordinationPolicy)`
    - `Library::Hip` defaults to `CoordinationPolicy::Auto`, which
      consults `mlirRocmSystemDetectGetHipHandle` via RTLD_DEFAULT
      first so all callers share one HSA session per process (KFD only
      permits one). RocmSystemDetect itself opts out via
      `CoordinationPolicy::Owned`.
    - `relatedHandle` is used for HIPRTC/HSA so they are dlmopen'd into
      the *same* link-map namespace as HIP via dlinfo(RTLD_DI_LMID).
      That preserves the single-HSA-session invariant for satellite
      libraries.
  * `resolveRocmSymbol(LoadedLibrary, name)` -- thin dlsym/GetProcAddress
    wrapper.
  * Platform candidate lists (`libamdhip64.so.{7,6}` / decorated
    `amdhip64_*.dll`, etc.) declared once.

Header-only on purpose: no new CMake target, no new linkage edge, no
ABI to version. Each consumer TU gets its own static copy of the
inline functions; the only cross-TU contract is the
`mlirRocmSystemDetectGetHipHandle` extern-C symbol that
`RocmSystemDetect.cpp` exports for RTLD_DEFAULT lookup. That stays
unchanged.

Two correctness improvements vs the per-file copies:

  1. AmdArchDb (in MLIRRockOps) previously did its own dlmopen,
     bypassing the RocmSystemDetect coordination. In a process that
     loads both libraries (e.g. a future xmir-runner that also lowers
     `rock.arch = "native"`), KFD's one-session limit would have
     caused the second dlmopen to return `hipErrorNoDevice` for every
     call. The new helper makes the coordination automatic for every
     consumer.
  2. HSA was previously dlmopen'd into a fresh namespace by AmdArchDb,
     even though HIP was already loaded into another namespace by the
     same TU. With `relatedHandle`, HSA shares HIP's namespace via
     dlinfo(RTLD_DI_LMID), keeping both on a single KFD session.

Made-with: Cursor
Replace the per-file dlmopen / dlopen / LoadLibraryA scaffolding in
RocmRuntimeWrappers.cpp and RocmSystemDetect.cpp with calls to the
shared `mlir::rocm_loader` helpers. The two files now share:

  * One copy of the SONAME candidate lists.
  * One copy of the `osOpen` / `osSym` / `_GNU_SOURCE` / `<dlfcn.h>`
    boilerplate.
  * One copy of the `mlirRocmSystemDetectGetHipHandle` RTLD_DEFAULT
    coordination dance (RocmSystemDetect uses
    `CoordinationPolicy::Owned` to opt out, since it *is* the
    canonical owner of the handle).

Net delta per file:
  * RocmRuntimeWrappers.cpp: -91 lines (was 75 lines of scaffolding,
    now 1 call to `loadRocmLibrary`)
  * RocmSystemDetect.cpp: -68 lines (was 60 lines of scaffolding +
    handle-coordination comment, now 3 lines)

Drive-by cleanups in the same diff:

  * Remove the unused `#include <mutex>` from RocmRuntimeWrappers.cpp.
  * Guard `#pragma GCC diagnostic` in RocmSystemDetect.cpp behind
    `__GNUC__ && !__clang__` so MSVC does not emit `C4068: unknown
    pragma` on a Windows build.

Behaviour is unchanged on glibc Linux (same dlmopen LM_ID_NEWLM, same
symbol search order); test results are identical (`ninja check-rocmlir`
1337/1337 pass, fat-lib build 472/472 pass).

Made-with: Cursor
Replace the per-file dlmopen scaffolding in `MLIRRockOps`'s
`AmdArchDb.cpp` and `rocmlir-tuning-driver`'s `HipDelayLoad.{h,cpp}`
with calls to the shared `mlir::rocm_loader` helpers (added in the
preceding [EXTERNAL] commit). Both rocMLIR-side TUs now coordinate
automatically with `RocmSystemDetect` through
`mlirRocmSystemDetectGetHipHandle`, so a process that loads
`MLIRRockOps` (for `rock.arch = "native"` lowering) and
`MLIRRocmExecutionEngineUtils` (for runner target detection) shares a
single HSA session instead of racing against KFD's
one-session-per-process limit.

Behavioural changes:

  1. AmdArchDb's HIP loader now consults
     `mlirRocmSystemDetectGetHipHandle` via RTLD_DEFAULT before
     dlmopen'ing its own libamdhip64. This was the latent bug
     identified in review: with the prior code, a future xmir-runner
     path that lowered `rock.arch = "native"` would have hit
     `hipErrorNoDevice` for every HIP call from RockOps. The
     coordination now matches what `libmlir_rocm_runtime.so` and the
     tuning driver already do.
  2. AmdArchDb's HSA loader now passes HIP's handle to
     `loadRocmLibrary(Library::Hsa, hipHandle)`, so HSA dlmopens into
     HIP's link-map namespace via dlinfo(RTLD_DI_LMID) and shares the
     same KFD session. Previously HSA would dlmopen LM_ID_NEWLM and
     `hsa_init()` could (silently) end up tied to a separate session.

Drive-by cleanups in the same diff:

  * AmdArchDb.cpp: replace `std::unordered_map<std::string,...>` with
    `llvm::StringMap` per LLVM coding standards.
  * AmdArchDb.cpp: drop the `#include <unordered_map>` and `<cstring>`
    that became dead after the refactor.
  * AmdArchDb.cpp: fix the file-banner typo (`Dtabase` -> `Database`).
  * HipDelayLoad.cpp: drop the redundant `#include <hip/hip_ext.h>`
    (already pulled in via HipDelayLoad.h) and the now-unused
    `<dlfcn.h>` / `<windows.h>` blocks.
  * MLIRRockOps CMakeLists: guard `target_include_directories(
    obj.MLIRRockOps ...)` with `if (TARGET ...)` for symmetry with the
    upstream ExecutionEngine CMake; iterate over both target names so
    HIP/HSA include dirs land on whichever is the actual compile
    target.

Net delta:
  * AmdArchDb.cpp: -116 lines (180 lines of dlmopen scaffolding +
    HipRuntime/HsaRuntime ctors collapsed to 1 call each).
  * HipDelayLoad.cpp: -118 lines (the entire SONAME + dlmopen +
    dlinfo + osSym block deleted; only the function-pointer table
    population remains).
  * HipDelayLoad.h: switch `void *handle` to
    `mlir::rocm_loader::LoadedLibrary lib` so the helper's opaque type
    is the contract.

Verified: build/ (BUILD_SHARED_LIBS=ON) `check-rocmlir` 1337/1337 pass
+ 5 XFAIL; build_migx/ (BUILD_FAT_LIBROCKCOMPILER=ON) `check-rocmlir`
472/472 pass + 3 XFAIL. MLIRRockUnitTests 111/111. Original gfx942
repro and `rock.arch="native"` paths both still pass.

Made-with: Cursor
Splits the formerly header-only `RocmDynamicLoader.h` into a small
public header and an implementation file so it is fit for upstream
LLVM submission:

  - Public header (`RocmRuntimeLoader.h`) is platform-agnostic. It no
    longer pulls in `<windows.h>` (which leaks `min`/`max`/`ERROR`
    macros) nor defines `_GNU_SOURCE` (a feature-test macro that has
    no business in an installed MLIR header).
  - Implementation (`RocmRuntimeLoader.cpp`) keeps `<windows.h>` and
    `_GNU_SOURCE` confined and matches upstream LLVM's
    `Support/Windows/DynamicLibrary.inc` pattern: UTF-8 SONAMEs are
    converted to UTF-16 via `convertUTF8ToUTF16String` and passed to
    `LoadLibraryW`. POSIX paths use `dlmopen(LM_ID_NEWLM, ...)` on
    glibc and degrade to `dlopen(RTLD_LAZY|RTLD_LOCAL)` elsewhere
    with a one-time `LLVM_DEBUG` advisory.
  - Replaces a zero-element `kHsaCandidates[]` on Windows (ill-formed
    in standard C++; rejected by MSVC) with a `(nullptr, 0)` sentinel.
  - Adds an `llvm_unreachable` for the impossible `Library` enum
    case.
  - Builds the new TU as a small static-library target
    (`MLIRRocmRuntimeLoader`) so consumers do not pay the cost of a
    new dynamic dependency. `mlir_rocm_runtime` and
    `MLIRRocmExecutionEngineUtils` link against it.

Documents the cross-library coordination ABI on the canonical
declaration: `mlirRocmSystemDetectGetHipHandle` is now declared in
`RocmSystemDetect.h` with full rationale (KFD's one-HSA-session-per-
process limit) and is used by the loader via `RTLD_DEFAULT` lookup so
no link-time dependency on `MLIRRocmExecutionEngineUtils` is needed.

Removes the now-stale `DISABLE_PCH_REUSE` "TODO: workaround" comment
on `mlir_rocm_runtime`. Symbol containment is enforced at link time
by the version script (`mlir_rocm_runtime.map`); the dynsym contract
is verified by a regression test added in a follow-up commit.

Made-with: Cursor
Updates the rocMLIR-side delay-load consumers (`AmdArchDb.cpp` for
`rock.arch = "native"`, and `rocmlir-tuning-driver`'s `HipDelayLoad`
helpers) to include the new header
`mlir/ExecutionEngine/RocmRuntimeLoader.h` and to link against the
new static library target `MLIRRocmRuntimeLoader`.

This is purely a rename + link-target switch; behaviour is identical.
The transitive `${CMAKE_DL_LIBS}` dependency that was previously
declared on each consumer is now exposed `PUBLIC`-ly by
`MLIRRocmRuntimeLoader` itself, so the redundant per-consumer link
lines are dropped.

Made-with: Cursor
Closes the test-coverage gap on three independent invariants of the
multi-LLVM isolation design:

1. `Dialect/Rock/Loader/no_rocm_neededs.test` -- asserts that the
   shipped binaries (`rocmlir-driver`, `rocmlir-opt`, `rocmlir-gen`,
   `rocmlir-tuning-driver`, `xmir-runner`, `mlir-runner`) and the
   ROCm-aware shared libraries (`libMLIRRockOps`,
   `libMLIRRocmRuntimeLoader`, `libmlir_rocm_runtime`,
   `libMLIRRocmExecutionEngineUtils`) never grow a `DT_NEEDED` entry
   for `libamdhip64`, `libhiprtc`, `libamd_comgr`, or ROCm's
   monolithic `libLLVM.so`. Any such regression would defeat the
   entire delay-load design by re-introducing the `cl::opt` /
   SmallPtrSet collisions at process startup.

2. `Dialect/Rock/Loader/dynsym_only_mgpu.test` -- asserts that
   `libmlir_rocm_runtime.so` exports nothing but the `mgpu*` C ABI
   entry points required by upstream MLIR's GPU lowering. This is
   what the `mlir_rocm_runtime.map` version script is supposed to
   guarantee; the test pins that contract so an accidental new
   source file or transitive LLVM dep does not silently widen the
   dynsym.

3. `RocmRuntimeLoaderTests` (gtest) -- pins the
   `CoordinationPolicy::Auto` contract: the second consumer to call
   `loadRocmLibrary(Hip)` MUST observe the same handle as
   `RocmSystemDetect`, otherwise KFD's "one HSA session per process"
   rule starts handing out `hipErrorNoDevice`. Also covers the
   `Owned` policy (used by the canonical owner to break recursion)
   and the soft-failure semantics of `resolveRocmSymbol` on a null
   handle.

Extends `native_arch.mlir` with a `rock.arch = "native:0"` case to
cover the explicit-device-id parsing path of the `--arch native:N`
runtime resolver.

Adds a `linux` lit feature (set when `platform.system() == "Linux"`)
for the two ELF-format tests; they `// REQUIRES: linux` so the suite
stays portable to macOS / Windows configurations. Also adds a
`%rocmlir_shlib_dir` lit substitution derived from the `host_ldflags`
rpath so the cleanliness scripts can locate the build's library
output directory without hard-coding `lib/` vs `lib64/`.

The new gtest sub-suite is conditionally compiled: it requires
`MLIRRocmExecutionEngineUtils` as a separate target, which only
exists in the standard shared-library configuration. Under
`BUILD_FAT_LIBROCKCOMPILER=ON` the symbols are not exposed
separately and the sub-suite is omitted; the contract is identical in
both configurations and is exercised end-to-end by the broader
regression suite.

Made-with: Cursor
Removes the hardcoded `libamdhip64.so.7` / `libamdhip64.so.6` /
`hiprtc0700.dll` etc. SONAME lists and replaces them with a generated
candidate list:

  1. The bare unversioned SONAME (`libamdhip64.so` / `amdhip64.dll`)
     comes first. This is what `find_package(hip)`, IREE's HIP HAL
     and Triton's HIP loader use. Standard ROCm installs always ship
     the unversioned symlink, so this is the fast path on every
     supported configuration.

  2. Numeric versioned SONAMEs (`libamdhip64.so.<MAJOR>` for MAJOR
     descending from `kMaxProbedRocmMajor` (99) to 1) are tried as a
     fallback for runtime-only deployments where the unversioned
     alias has been stripped. A failing `dlopen` of a missing SONAME
     costs ~5us on glibc, so the worst-case 99-element walk is sub-
     millisecond and is paid only once at startup.

  3. Windows HIPRTC is decorated as `hiprtc<MM>00.dll` (e.g.
     `hiprtc0700.dll` for ROCm 7.0); we generate that pattern for
     each candidate major instead of hardcoding two majors.

The same approach extends to HSA: `libhsa-runtime64.so` first, then
descending numeric majors. (HSA has stayed on `.so.1` for the entire
ROCm 4.x-7.x window, but iterating costs nothing if AMD ever bumps
it.)

Result: this loader is built once and works against any ROCm major
version present at runtime -- ROCm 4.x through any future ROCm
release, without code changes (so long as AMD stays at or below
`kMaxProbedRocmMajor`, which is `static_assert`-ably easy to bump).

Documents the version-compatibility contract on the public header
under "ROCm version compatibility" so reviewers can verify the
forward/backward-compatibility story without reading the
implementation.

Made-with: Cursor
Removes hardcoded LLVM- and rocMLIR-version filename suffixes from
the cleanliness lit tests, so a future LLVM merge or rocMLIR version
bump does not silently turn the assertions into no-ops.

Test infrastructure:
  - `check_no_rocm_neededs.sh` now globs library *base names* (e.g.
    `libMLIRRockOps.so*`, `libmlir_rocm_runtime.so*`) and audits
    every match. The script reports the artefact count it actually
    checked, fails loudly if zero artefacts were found, and skips
    bare `.so` developer symlinks to avoid double-reporting the same
    NEEDED set. Tested against a synthesized "future LLVM 99 +
    rocMLIR 42" tree to confirm it picks up arbitrary version
    suffixes.
  - `check_dynsym_only_mgpu.sh` likewise globs
    `libmlir_rocm_runtime.so*` instead of pinning `.so.23.0git`,
    falling back to the unversioned dev symlink as a last resort.
  - `lit.site.cfg.py.in`'s `%rocmlir_shlib_dir` substitution now
    parses the rpath out of `host_ldflags` and falls back through
    four well-known build-tree library directories
    (`<obj>/lib`, `<obj>/lib64`, `<obj>/external/llvm-project/lib`,
    `<llvm_obj>/lib`) instead of hardcoding `lib`.

Unit test:
  - Adds `RocmRuntimeLoaderVersionContract.FindsHipWithoutVersionHardcoding`,
    which proves the loader resolves HIP without compile-time
    knowledge of the ROCm major version. The test uses the `Auto`
    policy (the path downstream consumers actually take) to avoid
    KFD's "one HSA session per process" limit, which would otherwise
    mask a true loader failure under a spurious `dlmopen` failure
    from repeated `Owned` calls. Verifies that `hipGetDeviceCount`
    -- a symbol present on every HIP major since ROCm 1.x --
    resolves through the returned handle, pinning that the loaded
    library is actually HIP and not some other library that happened
    to match the SONAME pattern.

Made-with: Cursor
The header was renamed to `RocmRuntimeLoader.h` in commit e61ae4a, but
three file-level docblocks still referenced the pre-rename name. They
are comments only -- no code or include-path impact -- but the stale
text confuses a reviewer who tries to follow the cited path. Updated
in `RocmRuntimeWrappers.cpp`, `RocmSystemDetect.cpp` and the
rocMLIR-side `AmdArchDb.cpp`.

Made-with: Cursor
Strengthens the `CoordinationPolicy::Owned` doc comment to spell out
that the policy is reserved for `RocmSystemDetect` and must NOT be
called from elsewhere. Otherwise a downstream reader would see the
parameter and assume it is the more direct, simpler policy to use --
when in fact a second `Owned` invocation in the same process opens
HIP into a fresh `dlmopen(LM_ID_NEWLM, ...)` namespace, hits KFD's
"one HSA session per process" limit, and starts handing out
`hipErrorNoDevice` from every subsequent HIP call.

This caveat surfaced while writing the version-agnosticism contract
test: an `Owned`-using test happened to run after the canonical
owner had already loaded HIP, and the loader silently produced a
"working" handle that returned no devices. The test was switched to
use `Auto` (the policy downstream consumers should use); the doc
now blocks anyone else from making the same mistake.

Made-with: Cursor
Two upstream-style polish fixes to make the loader feel native to a
LLVM/MLIR reviewer:

  - Replace the hand-rolled `MLIR_ROCM_SHARED_HIP_EXPORT` macro
    (`__declspec(dllexport)` on Windows / `visibility("default")`
    elsewhere) with the canonical `LLVM_ALWAYS_EXPORT` from
    `llvm/Support/Compiler.h`. Bit-for-bit identical expansion, but
    the upstream macro is what reviewers expect and is consistent
    with how MLIR runtime wrappers (`CudaRuntimeWrappers.cpp`, etc.)
    publish their entry points. Updated the public docstring on
    `mlirRocmSystemDetectGetHipHandle` to reference the macro by name
    so a reader can grep its definition.

  - Document why `RocmRuntimeLoader.cpp` does NOT use the standard
    `llvm::sys::DynamicLibrary` wrapper. `sys::DynamicLibrary` always
    passes `RTLD_LAZY | RTLD_GLOBAL` to `dlopen` (see
    `lib/Support/Unix/DynamicLibrary.inc`) and there is no public
    knob to swap in `RTLD_LOCAL` or `dlmopen(LM_ID_NEWLM, ...)`.
    `RTLD_GLOBAL` is exactly the thing the loader is built to avoid:
    it lets the dynamic linker unify ROCm's `libLLVM.so` symbols
    with the host's embedded LLVM, which is the original `cl::opt`
    collision being fixed. The implementation below intentionally
    drops to raw OS APIs with the flags we actually need
    (`dlmopen(LM_ID_NEWLM, ..., RTLD_LAZY)` on glibc,
    `dlopen(..., RTLD_LAZY | RTLD_LOCAL)` on other POSIX,
    `LoadLibraryW` on Windows). The new file-level comment makes
    this deliberate deviation explicit so a future reviewer does not
    "fix" it by switching to `sys::DynamicLibrary`.

No behavioural change. Symbol table is unchanged
(`mlirRocmSystemDetectGetHipHandle` still exported with default
visibility), all 4 loader unit tests still pass, the dynsym /
NEEDED-cleanliness lit suite is unchanged.

Made-with: Cursor
Removes accumulated indirection that the file picked up across several
review-driven rewrites. No behavioural change.

  - Collapse 6 helper functions (`windowsLoadLibrary`,
    `posixOpenIsolated`, `posixOpenInSameNamespace`,
    `warnIfWeakIsolation`, `openIsolatedImpl`,
    `openInSameNamespaceImpl`) into 2 (`openIsolated`,
    `openInRelatedNamespace`). The two `*Impl` wrappers were
    dispatching to a single named impl per platform, adding a layer
    of named indirection without value; the platform branches now
    live directly inside `openIsolated`.

  - Drop the unreachable defensive `if (existingHandle && ...)`
    inside `posixOpenInSameNamespace`. The caller (now
    `openInRelatedNamespace`) already routes the null case to
    `openIsolated`, so the inner check could not fire.

  - Inline `warnIfWeakIsolation` into `openIsolated`'s POSIX branch
    -- the function existed only to guard a one-time advisory whose
    body and call site together cost more than the advisory itself.

  - Factor the bare + numeric-fallback enumeration into a small
    `appendCandidates(out, bare, joiner)` helper. Each `Library`
    case in `candidatesFor` shrinks from a 4-line `emplace_back +
    for-loop` block to a 3-line lambda call, and the platform-
    specific `joiner` lambda is the only place that knows how the
    versioned SONAME is composed.

  - Use `DEBUG_TYPE` string literal in `LLVM_DEBUG` messages instead
    of repeating the literal `"rocm-runtime-loader: "` prefix. The
    literal-concatenation idiom matches what `lib/Support/*.cpp`
    do.

  - `using namespace mlir::rocm_loader;` at file scope so the
    enum / struct names don't need the prefix in every `case`.

Net: -28 lines, -3 helper functions, -1 redundant indirection layer,
-1 unreachable code path. All 4 loader unit tests + the dynsym /
NEEDED-cleanliness lit suite still pass on both build trees
(default shared and `BUILD_FAT_LIBROCKCOMPILER=ON`).

Made-with: Cursor
Both scripts collapse a triple-nested bash loop (artefact x forbidden-prefix x
readelf-line / artefact x export-symbol x case-pattern) into a single
`awk` / `grep -E` pipeline per artefact.

  - `check_no_rocm_neededs.sh`: 111 -> 72 lines (-35%). Forbidden
    SONAMEs are now a single anchored regex
    (`^lib(amdhip64|hiprtc|amd_comgr|LLVM\.so)`) instead of an array
    of prefixes consumed by an inner `case ${forb}*` (which only
    worked because bash unquoted-expanded `${forb}*` as a glob, an
    easy thing for the next reader to misunderstand). Tool/library
    artefact lists become inline `for` arguments.

  - `check_dynsym_only_mgpu.sh`: 70 -> 61 lines. The
    bash `case mgpu* / *` partition becomes a single `awk` script
    that emits an `ok=N` count and a multi-line `BAD\n  sym1\n  sym2`
    block; the bash code then just greps for `^BAD`.

Verified end-to-end: success, skip-on-missing, and forbidden-NEEDED
detection all still work (the latter checked by linking
`/opt/rocm/lib/libamdhip64.so.7` -- which carries `libamd_comgr.so.3`
in its NEEDED set -- as `libMLIRRocmRuntimeLoader.so.99` and
confirming the script reports `FAIL: ... declares NEEDED
libamd_comgr.so.3` and exits non-zero).

Made-with: Cursor
A close re-read of the previous two simplification commits (68bb69e,
8195f8) surfaced five issues, three of them real bugs:

  1. (REAL) `check_no_rocm_neededs.sh` silently passed when both
     directory arguments were bogus. The "no artefacts found" guard
     could not fire because the tool slots were always added to the
     `artefacts` array unconditionally, then later silently skipped
     by a `[ -e "${art}" ] || continue` check. A run with
     `tools_dir=/nonexistent shlib_dir=/nonexistent` reported
     "0 artefacts checked, all clean" and exited 0, masking any
     misconfigured lit substitution. Tools are now added only when
     they actually exist, mirroring the shlib-glob pattern, and the
     wrong-directory failure message names the lit substitutions to
     check.

  2. (REAL) `check_no_rocm_neededs.sh` `forbidden_re` could
     false-positive-match SONAMEs that share a prefix but a
     different basename, e.g. a hypothetical
     `libamd_comgr_helper.so.1` would match because the previous
     `[._-]` separator class accepted `_` (which is part of the
     basename, not a version separator). Tightened the regex to
     anchor on `<name>.{so,dll}` followed by either end-of-string or
     a real version separator (`.` / `-`). Added a tabulation of
     true/false-positive cases to the comment so the discriminator
     is explicit.

  3. (REAL, cosmetic) `check_dynsym_only_mgpu.sh` emitted a stray
     blank line in its failure output -- the `BAD\n  sym1\n  sym2`
     awk format, after `sed 's/^BAD//'`, left an empty first line.
     Reformatted the awk to emit `OK <sym>` and `BAD <sym>` per
     line, which removes the sentinel and lets the bash side use
     plain `grep -c` / `cut`.

  4. (Code quality) Dropped the `static bool once = []() {
     LLVM_DEBUG(...); }();` advisory wrapper in `RocmRuntimeLoader
     .cpp`'s non-glibc POSIX path. `LLVM_DEBUG` is itself gated by
     `-debug-only=rocm-runtime-loader`, so the inner advisory
     already only fires when a developer has opted into the noise;
     suppressing it to once-per-process via a static bool is
     over-engineered. Replaced with a plain `LLVM_DEBUG(...)` that
     prints once on each load attempt under `-debug-only=`, which
     is exactly what every other LLVM_DEBUG advisory does.

  5. (Code quality) Replaced `#if defined(__GLIBC__) &&
     !defined(_WIN32)` with the single-condition `#if
     defined(__GLIBC__)` -- glibc is never defined on Windows, so
     the second clause was dead. Comment updated to make the
     implicit assertion explicit.

Verification: 115/115 unit tests still pass on the default build,
111/111 on fat-lib; the cleanliness lit suite passes on both. The
B1 fix exercised by `check_no_rocm_neededs.sh /nonexistent_lib
/nonexistent_tools` (now correctly exits 1 with a diagnostic). The
B2 fix exercised by a tabulated regex test covering 6 false
positives and 10 true positives. The B3 fix exercised by injecting
`/opt/rocm/lib/libamdhip64.so.7` as a fake `libmlir_rocm_runtime
.so.99` (now reports each non-mgpu symbol on its own indented line
without the empty-spacer artefact).

Made-with: Cursor
Cleanups identified by a fresh code review pass against the LLVM
Coding Standards. No behavioural change.

  - `RocmRuntimeLoader.h`: drop dead `<cstddef>` include (no `size_t`
    or `ptrdiff_t` referenced).

  - `RocmRuntimeLoader.cpp`: replace the file-scope `using namespace`
    pair (`mlir`, `mlir::rocm_loader`) with a single
    `namespace mlir::rocm_loader { ... }` block surrounding both the
    anonymous-namespace helpers and the public API. This is the
    convention used elsewhere in `mlir/lib/ExecutionEngine/`
    (`RocmSystemDetect.cpp`, `JitRunner.cpp`, ...) and avoids the
    second using-directive that no other TU in the directory uses.

  - `RocmRuntimeLoader.cpp`: drop the `auto *fn = reinterpret_cast<
    GetHandleFn>(...)` form for the `RTLD_DEFAULT` lookup. `auto *`
    with a function pointer compiles correctly but is unusual --
    readers expect `auto *` to mean "pointer to object". Replaced
    with plain `auto fn = ...` followed by an explicit `fn ? fn() :
    nullptr` ternary, which keeps the call-side null check obvious.

  - `RocmSystemDetect.cpp`: drop `#include "llvm/Support/Error.h"`
    (no `Expected`/`Error`/`cantFail`/etc. used) and the dead
    `#define DEBUG_TYPE "execution-engine-rocm-system-detect"`
    (no `LLVM_DEBUG()` calls in this TU).

  - `external/llvm-project/mlir/lib/ExecutionEngine/CMakeLists.txt`:
    correct the misleading "Static library so it adds no transitive
    NEEDED entries" comment on `MLIRRocmRuntimeLoader`. With
    `BUILD_SHARED_LIBS=ON`, `add_mlir_library(... STATIC ...)`
    produces a `.so`; the `STATIC` keyword is honored only in fat-
    library configurations. Reworded to explain that either form is
    safe because the only transitive dependency is `LLVMSupport`.

  - Same CMake file: update two version-coupled comments
    (`libLLVM.so.23` -> `libLLVM.so.<MAJOR>`) so a future LLVM
    version bump does not leave stale text in the file. Also fixed a
    `LoadLibraryA` -> `LoadLibraryW` doc typo (the implementation
    has always called `LoadLibraryW`).

Made-with: Cursor
Cleanups identified by a fresh code review pass. No behavioural
change to the build; the lit scripts gain a new diagnostic but pass
the same tests.

CMake comments
  - Root `CMakeLists.txt`, `mlir/lib/Dialect/Rock/IR/CMakeLists.txt`:
    update version-coupled doc strings (`libLLVM.so.23` ->
    `libLLVM.so.<MAJOR>`, `libLLVMSupport.so.*` ->
    `libLLVMSupport.so.<MAJOR>git`) so a future LLVM bump does not
    leave stale references in comments.

  - `mlir/test/lit.site.cfg.py.in`: the comment claimed
    `_resolve_rocmlir_shlib_dir` probes "the two well-known
    candidate directories" but the implementation tries four
    (`<mlir_obj>/lib`, `<mlir_obj>/lib64`,
    `<mlir_obj>/external/llvm-project/lib`, `<llvm_obj>/lib`).
    Updated.

  - `mlir/test/Dialect/Rock/Loader/no_rocm_neededs.test`: the doc
    referred to `libMLIRRocmRuntimeLoader.so` as "our static delay-
    load helper" -- only true in fat-lib mode (in the default shared
    build it is a `.so`). Reworded to explain that either form is
    safe because the loader's transitive closure is just LLVMSupport.
    Also corrected "pure shell + grep" -> "awk + grep" and tightened
    the false-positive-rejection note.

Lit scripts
  - `check_no_rocm_neededs.sh`, `check_dynsym_only_mgpu.sh`: split
    the previous `readelf | awk | grep` (resp. `nm | awk`) pipelines
    so that a `readelf` / `nm` failure no longer feeds empty input
    through the rest of the pipe and gets silently reported as
    "all clean". Each tool is now invoked with its exit status
    captured; a failure emits an explicit `FAIL: readelf -d <art>
    failed` (or the `nm` equivalent) and the script exits non-zero.

    Verified by feeding a non-ELF text file as a synthetic artefact:
    the script now correctly reports `FAIL: readelf -d ... failed`
    and exits 1, instead of the previous behaviour of silently
    succeeding.

Made-with: Cursor
Two latent bugs surfaced by a deeper review pass.

  1. (Real) The artefact globs in `check_no_rocm_neededs.sh` and
     `check_dynsym_only_mgpu.sh` matched `libfoo.so.*`, which
     happily catches debug-info companions like
     `libfoo.so.<version>.dwo` (split-DWARF, enabled by
     `LLVM_USE_SPLIT_DWARF=ON`), `libfoo.so.<version>.debug`
     (separate debug info) and `libfoo.so.<version>.dbg`. Feeding
     such a file to `readelf -d` / `nm -D` produces unhelpful
     output, and a `.dwo` is a stripped DWARF blob with no NEEDED
     entries -- it would silently report "all clean" while
     skipping the real shared object.

     Tightened the glob to `libfoo.so.[0-9]*` (the version suffix
     must start with a digit, which excludes `.dwo`/`.debug`/`.dbg`)
     and added a defensive `case "${cand}" in *.dwo|*.debug|*.dbg)
     continue ;; esac` for any unusual decoration that slips
     through. The brackets must stay outside double-quotes for bash
     to treat them as a character class -- documented in a comment
     so the next reader does not "fix" the quoting and re-introduce
     the bug.

     Verified by creating a synthetic build directory containing
     `libmlir_rocm_runtime.so.23.0git`, `.so.23.0git.dwo`, and
     `.so.23.0git.debug`: both scripts now correctly select the
     real `.so.23.0git` and report 25 mgpu* symbols / artefact-clean.

  2. (Portability) `_resolve_rocmlir_shlib_dir` in `lit.site.cfg.py.in`
     parsed only the `-Wl,-rpath -Wl,/path` two-argument spelling of
     rpath-on-link, which is what GCC/clang emit on Linux when CMake
     materialises `INSTALL_RPATH`. macOS and some clang invocations
     emit the comma-form `-Wl,-rpath,/path` instead, which the regex
     missed. The function then fell through to the four-way directory
     probe -- on macOS that happens to find `<mlir_obj>/lib` so the
     test still works, but the implicit fallback is fragile.

     Added a second pattern for the comma-form. The first match in
     either form wins. Verified against four host-ldflags shapes:
     Linux space-form, macOS comma-form, mixed (both present), and
     no-rpath.

Made-with: Cursor
Two correctness bugs in the `rock.arch = "native[:N]"` path, both
caught by a second-pair-of-eyes review of this branch.

Bug 1 (HIGH, multi-GPU correctness): `nativeArchInfo` cached
queried `AmdArchInfo` by `gcnArchName`. On a same-arch multi-GPU
system (e.g. two `gfx942` cards), the second device's call would
hit the cache entry populated by the FIRST device and silently
return its CU count, XCC count, and per-CU shared-memory --
masking real per-device variance from binning, fused-die
configurations, etc. The fix keys the cache by `deviceId`
(an `unsigned`, hence `llvm::DenseMap<unsigned, AmdArchInfo>`),
and uses an explicit lock-acquire / find / drop-lock / query /
re-acquire / `try_emplace` pattern so the expensive HIP/HSA query
does not block other threads.

Bug 4 (MEDIUM, parser): `parseArchString` (in `AmdArchDb.cpp`)
and `ParamLookupTable<...>::normalizeArch` (in `ParamLookupTable.cpp`)
both silently fell back to `deviceId = 0` on a malformed
`native:<suffix>`. Inputs like `native:foo`, `native:1abc`,
`native:`, or `native:-1` were treated as `native:0`, which on
multi-GPU systems silently targeted the wrong device. The fixed
parser distinguishes `native` (no colon, deviceId implicitly 0)
from `native:` (separator with empty/invalid suffix), validates
that the suffix parses as an unsigned integer that fits in
`unsigned`, and `report_fatal_error`s with a precise diagnostic
otherwise.

Made-with: Cursor
The previous implementation of `loadHipSymbols()` /
`loadHiprtcSymbols()` returned a partially-populated symbol table
with `lib.handle = nullptr` when libamdhip64 / libhiprtc was
missing on the loader path. The macros in `HipDelayLoadMacros.h`
then dispatched call sites like `hipMalloc(...)` /
`hiprtcCompileProgram(...)` straight through the (still null)
function pointer, segfaulting at the call site instead of giving
the user a controlled diagnostic.

Tightened the contract to "either fully loaded or process aborts":
both loaders now invoke a `[[noreturn]] abortMissing*` helper at
the first sign of trouble (library missing, symbol missing). The
helper writes a clear diagnostic (which library, which symbol)
to stderr and `std::abort()`s. The tuning driver is fundamentally
useless without HIP/HIPRTC -- every benchmark needs to launch a
kernel on an AMD GPU -- so converting "silent crash later" into
"loud diagnostic up front" is strictly an improvement.

The header docstrings on `HipSymbols` / `HiprtcSymbols` /
`getHipSymbols()` / `getHiprtcSymbols()` are updated to advertise
the new fail-fast contract: `lib.handle` is non-null and every
function pointer is non-null whenever these accessors return.

Made-with: Cursor
`MLIRRocmExecutionEngineUtils` was unconditionally built and
unconditionally referenced `${hip_INCLUDE_DIR}` -- but
`find_package(hip)` (which populates `hip_INCLUDE_DIR`) was gated
by `MLIR_ENABLE_ROCM_RUNNER`, which is forced to `0` when
`BUILD_FAT_LIBROCKCOMPILER=ON`. Result: in fat-lib configurations,
the target relied on an empty / leftover `hip_INCLUDE_DIR` and
either failed to build or silently picked up an ambient include
path from cached CMake state.

Move the entire `MLIRRocmExecutionEngineUtils` definition (target,
warning-flag wiring, include dirs, link libs) inside the
`if (LLVM_ENABLE_PIC) ... if (MLIR_ENABLE_ROCM_RUNNER) ...` block
where `find_package(hip)` already lives. The fat-lib build does
not need this target at all -- `xmir-runner` is also gated off in
that mode -- and `mlir/unittests/Dialect/Rock/CMakeLists.txt`
already conditions its `RocmRuntimeLoaderTests` source list on
`if (TARGET MLIRRocmExecutionEngineUtils)`, so the missing target
in fat-lib is handled gracefully.

Verified by reconfiguring + building both
`build/` (default shared, `MLIR_ENABLE_ROCM_RUNNER=ON`) -- target
present, libMLIRRocmExecutionEngineUtils.so produced -- and
`build_migx/` (`BUILD_FAT_LIBROCKCOMPILER=ON`) -- target absent,
build clean, `MLIRRockUnitTests` correctly omits
`RocmRuntimeLoaderTests` and runs the rest.

Made-with: Cursor
Pin the contracts behind the four bug fixes from the prior commits:

  - `mlir/test/Dialect/Rock/native_arch_invalid.mlir`: lit test
    that runs `rocmlir-opt -rock-affix-params` on a function with
    `rock.arch = "native:foo"` and asserts a fatal-error diagnostic.
    Uses `not --crash` (rather than plain `not`) because
    `report_fatal_error` aborts via SIGABRT, which `not` only
    accepts under `--crash`. The parser fails before any HIP query
    so the test runs anywhere (just `REQUIRES: linux` for the
    `not --crash` semantics).

  - `mlir/unittests/Dialect/Rock/AmdArchDbTests.cpp`: three new
    gtest cases.
      * `NativeArchParseTest.MalformedSuffixAborts` uses
        `EXPECT_DEATH` to check that `lookupArchInfo("native:foo")`,
        `"native:1abc"`, `"native:"`, and `"native:-1"` each abort
        with the precise diagnostic. Death tests do not need a real
        GPU since the parser fires before HIP is touched.
      * `NativeArchParseTest.BareNativeIsDeviceZero` confirms that
        `lookupArchInfo("native")` (no colon) is well-formed and
        does NOT abort. Skipped when no GPU is present so the call
        actually completes without aborting elsewhere.
      * `NativeArchCacheTest.SameArchMultiGpuDistinct` exercises
        the per-device cache contract on systems with two same-arch
        GPUs: two distinct device ids must each return a value from
        an independently-queried cache entry, never share device 0's
        copy. Skipped when fewer than 2 same-arch GPUs are visible
        (typical CI).

Made-with: Cursor
@umangyadav umangyadav self-assigned this Apr 24, 2026
@umangyadav umangyadav changed the title Dlopen hip Lazy load HIP library to avoid LLVM versions incompatibilities Apr 24, 2026
@umangyadav umangyadav requested a review from Copilot April 24, 2026 13:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes link-time dependencies on ROCm runtime libraries (notably libamdhip64 and transitively ROCm’s monolithic libLLVM.so) by introducing an isolated, delay-loading mechanism. This avoids the “two LLVMs in one process” static-initializer collision that crashes rocMLIR/MLIR tools at process startup on some ROCm installations.

Changes:

  • Add mlir::rocm_loader::loadRocmLibrary(...) + resolveRocmSymbol(...) and wire ROCm-aware components to resolve HIP/HIPRTC/HSA symbols at runtime (with link-map isolation via dlmopen on glibc).
  • Update build/linker settings (version script for mlir_rocm_runtime, --exclude-libs,ALL for tools, -Bsymbolic-functions for shared libs) to reduce symbol interposition/ODR collisions.
  • Add lit + gtest coverage to enforce “no ROCm DT_NEEDED” and “only mgpu* exports” contracts, and improve native-arch parsing/caching behavior.

Reviewed changes

Copilot reviewed 32 out of 33 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
mlir/utils/jenkins/static-checks/get_fat_library_deps_list.pl Exclude certain libs from fat-archive dependency list generation.
mlir/unittests/Dialect/Rock/RocmRuntimeLoaderTests.cpp New gtests for loader coordination and version-agnostic behavior.
mlir/unittests/Dialect/Rock/CMakeLists.txt Conditionally add loader tests + link required targets.
mlir/unittests/Dialect/Rock/AmdArchDbTests.cpp Refactor tests to avoid link-pulling HIP; add native parsing/cache tests.
mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp Switch HIP calls to delay-load dispatch via macros.
mlir/tools/rocmlir-tuning-driver/HipDelayLoadMacros.h Macro redirects for hip*/hiprtc* callsites through function tables.
mlir/tools/rocmlir-tuning-driver/HipDelayLoad.h Delay-load HIP/HIPRTC API surface for tuning driver.
mlir/tools/rocmlir-tuning-driver/HipDelayLoad.cpp Implementation of symbol resolution + fail-fast diagnostics.
mlir/tools/rocmlir-tuning-driver/CacheFlush.cpp Use delay-loaded HIP/HIPRTC entry points.
mlir/tools/rocmlir-tuning-driver/CMakeLists.txt Stop linking HIP/HIPRTC; use includes/defs + link runtime loader helper.
mlir/tools/rocmlir-lib/CMakeLists.txt Document fat-lib runtime expectations for native arch.
mlir/test/lit.site.cfg.py.in Add features/substitutions to support new loader/ELF auditing tests.
mlir/test/lit.cfg.py Add .test suffix and exclude helper shell scripts from test discovery.
mlir/test/Dialect/Rock/native_arch_invalid.mlir New lit test for invalid native:* arch handling.
mlir/test/Dialect/Rock/native_arch.mlir Lit coverage for rock.arch="native" and "native:0" pipeline flow.
mlir/test/Dialect/Rock/Loader/no_rocm_neededs.test Lit shell test asserting no forbidden ROCm runtime DT_NEEDED deps.
mlir/test/Dialect/Rock/Loader/dynsym_only_mgpu.test Lit shell test asserting only mgpu* exports from runtime wrapper.
mlir/test/Dialect/Rock/Loader/check_no_rocm_neededs.sh Helper script to audit DT_NEEDED entries via readelf -d.
mlir/test/Dialect/Rock/Loader/check_dynsym_only_mgpu.sh Helper script to audit exports via nm -D.
mlir/lib/ExecutionEngine/CMakeLists.txt Mark conv-validation-wrappers as partial sources intended.
mlir/lib/Dialect/Rock/Tuning/ParamLookupTable.cpp Normalize native[:N] by resolving to hardware arch and caching.
mlir/lib/Dialect/Rock/IR/CMakeLists.txt Remove HIP/HSA link deps; add include-only wiring and loader dependency.
mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp Implement HIP/HSA delay-load, stricter parsing, and deviceId-keyed cache.
mlir/include/mlir/Dialect/Rock/IR/AmdArchDb.h Expose nativeDeviceCount() and nativeArchName() APIs.
external/llvm-project/mlir/lib/ExecutionEngine/mlir_rocm_runtime.map New linker version script to export only mgpu*.
external/llvm-project/mlir/lib/ExecutionEngine/RocmSystemDetect.cpp Delay-load HIP and export shared HIP handle for coordination.
external/llvm-project/mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp Delay-load HIP inside runtime wrappers; explicitly export mgpu APIs.
external/llvm-project/mlir/lib/ExecutionEngine/RocmRuntimeLoader.cpp New isolated loader implementation (dlmopen/dlopen/LoadLibraryW).
external/llvm-project/mlir/lib/ExecutionEngine/CMakeLists.txt Add MLIRRocmRuntimeLoader; remove link-time HIP deps; apply version script.
external/llvm-project/mlir/include/mlir/ExecutionEngine/RocmSystemDetect.h Declare exported shared-HIP-handle accessor for coordination.
external/llvm-project/mlir/include/mlir/ExecutionEngine/RocmRuntimeLoader.h Public API for ROCm delay-loading and symbol resolution.
cmake/llvm-project.cmake Add --exclude-libs,ALL to tools to avoid re-exporting LLVM symbols.
CMakeLists.txt Apply -Bsymbolic-functions globally for shared/module libs (non-Apple/non-Win).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +19 to 35
my %excludedLibs = map { $_ => 1 } qw(
conv-validation-wrappers
);

my @rocmlirLibs;
my @mlirLibs;

my @deps = split /\n/,`ninja -t query lib/libMLIRRockThin.a`;
not $? or die "failed to get target dependencies";

foreach (@deps) {
last if /outputs:/;
if (m#external/llvm-project/llvm/lib/lib(\w+)\.a#) {
push @mlirLibs, $1;
push @mlirLibs, $1 unless $excludedLibs{$1};
} elsif (m#lib/lib(\w+)\.a#) {
push @rocmlirLibs, $1;
push @rocmlirLibs, $1 unless $excludedLibs{$1};
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%excludedLibs won’t actually exclude conv-validation-wrappers here because the regexes only capture lib(\w+)\.a (\w doesn’t match -). As a result, hyphenated library names are skipped entirely and the exclusion list is misleading/ineffective. Consider widening the capture to include - (e.g. [\w-]+) so the dependency is parsed and then filtered intentionally (or adjust the parsing to match Ninja’s actual output format).

Copilot uses AI. Check for mistakes.
void appendCandidates(std::vector<std::string> &out, llvm::StringRef bare,
Joiner joiner) {
out.emplace_back(bare.str());
for (unsigned m = kMaxProbedRocmMajor; m >= 1; --m)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appendCandidates uses for (unsigned m = kMaxProbedRocmMajor; m >= 1; --m), which never terminates for an unsigned loop variable (it wraps from 0 to UINT_MAX and m >= 1 remains true). This will hang when building the candidate list. Use a termination-safe form like for (unsigned m = kMaxProbedRocmMajor; m > 0; --m) or a signed counter.

Suggested change
for (unsigned m = kMaxProbedRocmMajor; m >= 1; --m)
for (unsigned m = kMaxProbedRocmMajor; m > 0; --m)

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +119
if (arch.consume_front("native")) {
static std::mutex m;
static llvm::StringMap<std::string> resolvedCache;
std::lock_guard<std::mutex> lock(m);
// Re-form the original key for the cache lookup.
std::string cacheKey = ("native" + arch).str();
auto [it, inserted] = resolvedCache.try_emplace(cacheKey);
if (inserted) {
unsigned deviceId = 0;
if (arch.consume_front(":")) {
// Parse must succeed and fit in `unsigned`. Silently treating
// `native:foo` or `native:` as `native:0` would mask user error
// and target the wrong GPU on multi-GPU systems.
unsigned long long parsed = 0;
if (arch.empty() || llvm::getAsUnsignedInteger(arch, 0, parsed) ||
parsed > std::numeric_limits<unsigned>::max())
llvm::report_fatal_error(
Twine("Invalid `") + cacheKey +
"`: the suffix after `native:` must be a non-negative integer "
"device id (got `" +
arch + "`).");
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalizeArch treats any string with prefix "native" as the special native-arch syntax because it uses arch.consume_front("native"). That means inputs like "nativefoo" would be interpreted as native (device 0) instead of being rejected, which is inconsistent with parseArchString in AmdArchDb and could mask user typos. Consider checking arch == "native" or arch.starts_with("native:") (and leaving other strings to the regular gfx... parsing).

Suggested change
if (arch.consume_front("native")) {
static std::mutex m;
static llvm::StringMap<std::string> resolvedCache;
std::lock_guard<std::mutex> lock(m);
// Re-form the original key for the cache lookup.
std::string cacheKey = ("native" + arch).str();
auto [it, inserted] = resolvedCache.try_emplace(cacheKey);
if (inserted) {
unsigned deviceId = 0;
if (arch.consume_front(":")) {
// Parse must succeed and fit in `unsigned`. Silently treating
// `native:foo` or `native:` as `native:0` would mask user error
// and target the wrong GPU on multi-GPU systems.
unsigned long long parsed = 0;
if (arch.empty() || llvm::getAsUnsignedInteger(arch, 0, parsed) ||
parsed > std::numeric_limits<unsigned>::max())
llvm::report_fatal_error(
Twine("Invalid `") + cacheKey +
"`: the suffix after `native:` must be a non-negative integer "
"device id (got `" +
arch + "`).");
if (arch == "native" || arch.starts_with("native:")) {
StringRef nativeSuffix = arch;
[[maybe_unused]] bool consumed = nativeSuffix.consume_front("native");
assert(consumed && "native arch prefix must be present");
static std::mutex m;
static llvm::StringMap<std::string> resolvedCache;
std::lock_guard<std::mutex> lock(m);
// Re-form the original key for the cache lookup.
std::string cacheKey = ("native" + nativeSuffix).str();
auto [it, inserted] = resolvedCache.try_emplace(cacheKey);
if (inserted) {
unsigned deviceId = 0;
if (nativeSuffix.consume_front(":")) {
// Parse must succeed and fit in `unsigned`. Silently treating
// `native:foo` or `native:` as `native:0` would mask user error
// and target the wrong GPU on multi-GPU systems.
unsigned long long parsed = 0;
if (nativeSuffix.empty() ||
llvm::getAsUnsignedInteger(nativeSuffix, 0, parsed) ||
parsed > std::numeric_limits<unsigned>::max())
llvm::report_fatal_error(
Twine("Invalid `") + cacheKey +
"`: the suffix after `native:` must be a non-negative integer "
"device id (got `" +
nativeSuffix + "`).");

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +80
# Apply to both the object library (the actual compile target) and the
# aggregate shared library, for parity with how upstream ExecutionEngine
# wires include dirs for its HIP-touching TUs.
foreach(_t MLIRRockOps obj.MLIRRockOps)
if (TARGET ${_t})
target_include_directories(${_t} SYSTEM PRIVATE ${hip_INCLUDE_DIRS})
endif()
endforeach()
unset(_t)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block uses ${hip_INCLUDE_DIRS}, but elsewhere in the repo (e.g. external/llvm-project/mlir/lib/ExecutionEngine/CMakeLists.txt) the HIP package is treated as providing ${hip_INCLUDE_DIR}. If ${hip_INCLUDE_DIRS} is unset for the ROCm HIP CMake package on some installations, this will fail to add the include path and break the build. Consider switching to ${hip_INCLUDE_DIR} (or guarding/falling back to whichever variable is defined).

Suggested change
# Apply to both the object library (the actual compile target) and the
# aggregate shared library, for parity with how upstream ExecutionEngine
# wires include dirs for its HIP-touching TUs.
foreach(_t MLIRRockOps obj.MLIRRockOps)
if (TARGET ${_t})
target_include_directories(${_t} SYSTEM PRIVATE ${hip_INCLUDE_DIRS})
endif()
endforeach()
unset(_t)
if (hip_INCLUDE_DIRS)
set(_hip_include_dirs ${hip_INCLUDE_DIRS})
elseif(hip_INCLUDE_DIR)
set(_hip_include_dirs ${hip_INCLUDE_DIR})
else()
message(FATAL_ERROR
"HIP CMake package was found, but neither `hip_INCLUDE_DIRS` nor "
"`hip_INCLUDE_DIR` is defined.")
endif()
# Apply to both the object library (the actual compile target) and the
# aggregate shared library, for parity with how upstream ExecutionEngine
# wires include dirs for its HIP-touching TUs.
foreach(_t MLIRRockOps obj.MLIRRockOps)
if (TARGET ${_t})
target_include_directories(${_t} SYSTEM PRIVATE ${_hip_include_dirs})
endif()
endforeach()
unset(_t)
unset(_hip_include_dirs)

Copilot uses AI. Check for mistakes.
Base automatically changed from upstream-merge-jan-26 to develop April 28, 2026 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants