Lazy load HIP library to avoid LLVM versions incompatibilities#2357
Lazy load HIP library to avoid LLVM versions incompatibilities#2357umangyadav wants to merge 31 commits intodevelopfrom
Conversation
`MLIRRockOps` linked `hip::host`/`hip::amdhip64` (and `hsa-runtime64`)
directly so it could query GPU properties for `--arch native:N`. On
recent ROCm, `libamdhip64` transitively depends on `libamd_comgr.so`,
which in turn depends on ROCm's `libLLVM.so.23`. Any binary that embeds
rocMLIR's own LLVM (e.g. `rocmlir-driver`, `rocmlir-opt`) therefore
ended up with two LLVMs in the same address space. Their `cl::opt`
static initializers fight over the global option registry and the
process aborts with `LLVM ERROR: Option '...' already exists!`.
Confine HIP/HSA to a tiny standalone shared library
(`mlir_rocm_arch_runtime`) and load it lazily from `AmdArchDb.cpp` via
`dlmopen(LM_ID_NEWLM, ...)` on glibc / `LoadLibraryW` on Windows. The
fresh dynamic-linker namespace keeps ROCm's `libLLVM.so` from snapping
onto rocMLIR's embedded LLVM, so duplicate `cl::opt` registration can
no longer occur. Tools that never request `--arch native:N` never
load the runtime and stay completely free of HIP/comgr/LLVM-23.
Highlights:
- New `mlir/lib/ExecutionEngine/RocmArchRuntime.{h,cpp}`: stable C ABI
(`mlirRocmArchRuntimeAbiVersion`, `mlirRocmArchRuntimeDeviceCount`,
`mlirRocmArchRuntimeGetProperties`) implemented on top of HIP+HSA.
Built with plain `add_library(... SHARED)` so it does NOT pull in
`LLVMSupport`.
- `MLIRRockOps` drops all HIP/HSA link-time deps. New runtime loader
in `AmdArchDb.cpp` searches `<exe-dir>`, `<exe-dir>/../lib/`,
`ROCMLIR_BUILD_RUNTIME_DIR` (compile-time path for nested test
binaries), then the system loader path.
- Public helpers `nativeDeviceCount()` and `nativeArchName()` added so
callers (unit tests, Python bindings) can use the runtime without
including HIP headers.
- `ParamLookupTable::normalizeArch` learns to canonicalise `native[:N]`
by routing through the runtime, so `rock.arch = "native"` flows
through the full backend pipeline end-to-end.
- `mlir/lib/CMakeLists.txt` wires `add_dependencies(MLIRRockOps
mlir_rocm_arch_runtime)` after both subdirectories are processed,
and exposes `ROCMLIR_BUILD_RUNTIME_DIR` as a compile def pointing at
`LIBRARY_OUTPUT_DIRECTORY` (POSIX `.so`) or `RUNTIME_OUTPUT_DIRECTORY`
(Windows `.dll`).
- Fat-build (`BUILD_FAT_LIBROCKCOMPILER=ON`): the runtime stays as a
separate shared library (intentionally NOT bundled into
`librockCompiler.a`) and is excluded from
`librockcompiler_deps.cmake` via an explicit allowlist filter in
`get_fat_library_deps_list.pl`.
Tests:
- `MLIRRockUnitTests::AmdArchDbTests` rewritten to use the public
AmdArchDb API; no longer includes `hip/hip_runtime_api.h` and is
built unconditionally on all platforms.
- New `mlir/test/Dialect/Rock/native_arch.mlir` lit test exercising
`rock.arch = "native"` through `-rock-affix-params`. Gated on a new
`amd-gpu-present` lit feature so it skips on no-GPU CI machines.
Validated:
- Original repro `rocmlir-driver --kernel-pipeline=migraphx,highlevel
--arch gfx942` produces output identical to baseline.
- `rock.arch = "native"` end-to-end through `--kernel-pipeline=full`
emits a `gpu.binary` with `chip = "gfx942"`.
- `ldd` of `MLIRRockOps.so`, `rocmlir-driver`, `rocmlir-opt`, and
`MLIRRockUnitTests` is clean: no `libamdhip64`, no `libamd_comgr`,
no `libLLVM.so.23`, no `libclang-cpp.so.23`.
- Unit tests: 111/111 pass; 8 NativeArch tests run against real MI300X
hardware and match presets.
- Rock lit suite (excl. `integration/`): 102/102 pass.
- Fat build configured with locally-installed rocm-cmake builds the
static `librockCompiler.a` plus the separate `mlir_rocm_arch_runtime.so`
and reproduces both the original `gfx942` case and the
`rock.arch = "native"` case.
Made-with: Cursor
When a process embeds rocMLIR's LLVM and also loads libmlir_rocm_runtime.so
(e.g. mlir-runner --shared-libs=libmlir_rocm_runtime.so), the runtime
wrapper brings ROCm's libamdhip64 into the address space. libamdhip64
transitively pulls /opt/rocm/lib/llvm/lib/libLLVM.so.23 via libamd_comgr,
and the dynamic linker unifies llvm::cl::SubCommand::getTopLevel() and the
rest of the cl::opt global registry across both LLVM instances. Static
initializers in ROCm's LLVM then race with rocMLIR's LLVM and the process
aborts with "LLVM ERROR: Option '...' already exists!".
Hide everything that is not mgpu* from libmlir_rocm_runtime.so's dynsym:
* CXX_VISIBILITY_PRESET=hidden and VISIBILITY_INLINES_HIDDEN=ON so
non-annotated symbols default to STV_HIDDEN at compile time.
* -Wl,--exclude-libs,ALL so symbols pulled in from any static archive
(hip device runtime, clang-rt, libcxx internals) are made STB_LOCAL
at link time even if they were compiled with default visibility.
* -Wl,--version-script=mlir_rocm_runtime.map to assert the contract
at link time: only mgpu* may be exported. Anything else fails the
link rather than silently re-exposing LLVM helpers.
Windows is already default-hidden for DLL symbols, so the guard is a
no-op there. Apple's ld does not support GNU version scripts, so it is
also skipped.
Made-with: Cursor
Now that libmlir_rocm_runtime.so hides its LLVM symbols at link time
(preceding commit), the `cl::opt` collision that motivated a separate,
dlmopen-isolated shim library no longer manifests. Fold the HIP/HSA
delay-loader back into MLIRRockOps itself using plain dlopen (POSIX) /
LoadLibraryW (Windows).
Why this is safe without dlmopen:
* MLIRRockOps does not link libamdhip64 / libhsa-runtime64; the shared
objects are resolved at run time by SONAME only, so the dynamic linker
only searches for them when `rock.arch = "native[:N]"` is asked for.
* When HIP is loaded this way, RTLD_LOCAL keeps its transitive deps
(including ROCm's libLLVM.so via libamd_comgr) in HIP's own scope:
they are not re-exposed to us, and they will not satisfy undefined
references from rocMLIR's embedded LLVM.
* rocMLIR binaries embedding LLVM do not re-export cl:: state through
their dynsym either (the LLVM libraries are built with default
-fvisibility-inlines-hidden), so even when HIP's libLLVM.so mmap's
into the process, its static initializers query the rocMLIR-side
cl::opt globals through HIP-local symbol resolution and never see
rocMLIR's own registry. The two LLVMs stay disjoint.
Why in-process delay-load is better than the shim:
* No separate install artifact for downstream consumers (MIOpen,
MIGraphX) to ship; a ROCm install visible on LD_LIBRARY_PATH / RPATH
is sufficient.
* No ABI boundary to version (the shim had MLIR_ROCM_ARCH_RUNTIME_ABI_VERSION
plus a POD properties struct); fewer moving parts.
* No dlmopen, which is a glibc-specific extension with known gotchas
(TLS, static destructors, fork()) and no counterpart on musl.
Details:
* AmdArchDb.cpp: two tiny lazy loaders (HipRuntime, HsaRuntime) resolve
the few symbols we need (hipGetDeviceCount,
hipGetDevicePropertiesR0600 with a fallback to hipGetDeviceProperties,
hsa_init, hsa_iterate_agents, hsa_agent_get_info). SONAME candidates
cover ROCm 6 and 7 and unversioned installs. The HIP and HSA headers
are included only for POD layouts; they are not link-time deps.
* CMakeLists.txt for MLIRRockOps now does `find_package(hip QUIET)` and
only picks up include dirs; no hip::host / hip::amdhip64 link. HSA
headers are likewise include-only.
* Navi WGP-as-CU correction for shared-memory per CU runs before
checkAndSetInfo() so the HSA-assisted value flows through the same
code path as before.
* The mlir_rocm_arch_runtime target, RocmArchRuntime.{h,cpp},
ROCMLIR_BUILD_RUNTIME_DIR compile def, and the Perl allowlist for
the fat-lib Perl filter are all removed.
* Unit test and lit test comments are updated to match the new loader
path. Their behaviour is unchanged.
Verified: readelf -d build/lib/libMLIRRockOps.so.2.0 shows no NEEDED
entries for libamdhip64, libhsa-runtime64, libamd_comgr, or
libLLVM.so.*. The full chain will be re-validated in the next commit.
Made-with: Cursor
Plain `dlopen(RTLD_LAZY | RTLD_LOCAL)` is insufficient to break the
`cl::opt` collision against ROCm's libLLVM.so.23 in practice. Even
though RTLD_LOCAL stops us from re-exporting symbols from the dlopened
module, it does not hide rocMLIR's own process-global LLVM symbols from
the dynamic linker when libLLVM.so.23 runs its static initializers.
On glibc, libLLVM.so.23 is pulled in transitively by libamd_comgr, and
its static-init cl::opt registrations resolve against rocMLIR's
unversioned cl::* symbols, which registers the same option twice in
rocMLIR's registry. `rocmlir-opt -rock-affix-params` on
mlir/test/Dialect/Rock/native_arch.mlir reproduces this as:
LLVM ERROR: Option 'default' already exists!
...
libLLVMCodeGen.so.22.0git llvm::report_fatal_error(...)
libLLVM.so.23.0git <static init>
ld-linux-x86-64.so.2 _dl_catch_exception
dlmopen(LM_ID_NEWLM, ...) places libamdhip64 and its transitive deps
into a fresh link-map namespace where rocMLIR's symbols are not
visible, so libLLVM.so.23's static init can only resolve its cl::*
references against its own copies. The two LLVM instances stay
completely disjoint.
The non-glibc POSIX fallback keeps plain dlopen + RTLD_LOCAL. That
branch only works if the process also hides its LLVM surface via
build-side discipline (`-Wl,--exclude-libs,ALL`); see the follow-up
commits that apply that to rocMLIR executables. Windows is unchanged:
LoadLibraryW already gives a per-DLL private scope.
Verified:
* mlir/test/Dialect/Rock/native_arch.mlir passes (was crashing).
* Original repro (rocmlir-driver --kernel-pipeline=migraphx,highlevel
--arch gfx942) still passes.
* `rock.arch = "native"` via MLIR attribute still produces a valid
gpu.binary.
Made-with: Cursor
Add -Wl,--exclude-libs,ALL to add_rocmlir_tool(). Any static archive that gets pulled into a tool (e.g. libLLVMSupport.a in the fat-lib build, or intermediates on a conventional static build) is turned into STB_LOCAL by the linker rather than being re-exported from the executable's dynsym. This is load-bearing on the non-dlmopen branches of AmdArchDb.cpp and on musl / FreeBSD where dlmopen is unavailable: if the tool re-exports llvm::cl::* globals, a later dlopen of libamd_comgr / libLLVM.so.23 will interpose against them and crash with "Option '...' already exists" at static-init time. The flag is a no-op when every LLVM/MLIR dependency is already a shared library (the BUILD_SHARED_LIBS=ON default), so landing it now does not change the shape of the built tools on this developer machine. It still measurably reduces their dynamic symbol surface: dynsym on rocmlir-opt drops from ~34 template weak symbols to 1, making future regressions easier to spot with `nm -D`. Apple's linker does not accept --exclude-libs; the guard matches the one used for the libmlir_rocm_runtime.so version script. Made-with: Cursor
The preceding commit paired --version-script=mlir_rocm_runtime.map with
CXX_VISIBILITY_PRESET=hidden. The combination over-hides: the mgpu*
entry points in upstream RocmRuntimeWrappers.cpp are not annotated with
__attribute__((visibility("default"))), so the hidden preset marks them
STV_HIDDEN at compile time and the version script has nothing to export
(`global: mgpu*;` only overrides global/local partitioning, it does not
un-hide compile-time-hidden symbols).
The result is a libmlir_rocm_runtime.so with an empty dynsym, which
breaks any mlir-runner / xmir-runner invocation that tries to dlsym an
mgpu* entry point for JIT launches.
Fix: use the version script alone and rely on default visibility. The
script still hides everything that is not mgpu*, which is all we need
to stop llvm::* leakage. Upstream CudaRuntimeWrappers.cpp solves the
same problem with explicit MLIR_CUDA_WRAPPERS_EXPORT annotations on
each entry point; we avoid that here to keep the rocMLIR-side patch
minimal.
Verified: nm -D --defined-only on libmlir_rocm_runtime.so now shows
exactly the 25 mgpu* T entries and zero llvm::* symbols.
Made-with: Cursor
Set CMAKE_SHARED_LINKER_FLAGS / CMAKE_MODULE_LINKER_FLAGS to include -Wl,-Bsymbolic-functions for the entire build on non-Windows, non-Apple targets. This affects both rocMLIR's own libMLIR*.so and the embedded LLVM's libLLVM*.so.22.0git. Effect: intra-library cl::* function calls (e.g. within libLLVMSupport.so.22.0git, libLLVMCodeGen.so.22.0git, ...) are resolved to local definitions at link time rather than going through the PLT. The dynamic linker cannot later interpose them with any other loaded shared object's cl::* implementation, in particular ROCm's libLLVM.so.23.0git (pulled in transitively via libamdhip64 -> libamd_comgr when a runner dlopens libmlir_rocm_runtime.so). Verified with `objdump -d --section=.plt` on libLLVMSupport.so.22.0git: zero cl::* PLT entries after this change. Scope caveat: -Bsymbolic-functions addresses function-symbol interposition only. Data-symbol interposition (vtables, the cl::SubCommand process-global singleton) is unaffected, so the mlir-runner / xmir-runner JIT path still hits the "Option '...' already exists!" collision during static-init. Those tests remain pre-existing upstream MLIR failures; closing them requires either a version script hiding llvm::cl::* from the dynsym of libLLVMSupport.so.* or a full CXX_VISIBILITY_PRESET=hidden on the embedded LLVM build. Both are tracked as follow-ups. Regression guard: 102/102 rock lit tests (excl. integration/) and 111/111 MLIRRockUnitTests still pass with this flag active. Made-with: Cursor
…mopen
Hiding cl::* from libmlir_rocm_runtime.so's dynsym (commit 8010fdf) was
necessary but not sufficient: the runtime still pulled libamdhip64 in as
a direct NEEDED, which transitively dragged libamd_comgr and ROCm's
libLLVM.so.23 into mlir-runner's / xmir-runner's main namespace at
dlopen time. ROCm's libLLVM.so.23 then ran static initializers that
unified the cl::SubCommand singleton across the two LLVMs, aborting
JIT execution with "Option 'default' already exists!".
Apply the same isolation pattern that mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp
already uses for the rocMLIR compiler side:
* Remove `hip::host hip::amdhip64` from `mlir_rocm_runtime`'s link
line. Keep the HIP headers reachable via `hip_INCLUDE_DIR` so we can
still see hipDeviceProp_t / hipMemcpyKind layouts.
* Resolve every hipXXX entry point through a function-pointer table
populated by dlmopen(LM_ID_NEWLM, "libamdhip64.so.7", RTLD_LAZY) on
glibc, plain dlopen elsewhere on POSIX, and LoadLibraryA on Windows.
* Bare hipXXX call sites are kept verbatim by routing them through
object-like macros (`#define hipModuleLoadData(...)
(::getHip().moduleLoadData(__VA_ARGS__))` etc.); the rest of the
file stays byte-for-byte identical to upstream.
* mgpu* entry points pick up explicit MLIR_HIP_WRAPPERS_EXPORT
visibility annotations so the existing libmlir_rocm_runtime.so
version script keeps exporting them after the next merge from
upstream (where default-visibility was the only thing exporting
them).
Effect: `readelf -d build/.../libmlir_rocm_runtime.so` no longer lists
libamdhip64 as NEEDED. ROCm's libamd_comgr and libLLVM.so.23 land in
the private dlmopen namespace and are completely invisible to the host
process's LLVM.
Verified end-to-end:
* mlir/test/rocmlir-driver/conv2d_harness/conv2d_harness.mlir (and
siblings that previously crashed in mlir-runner): PASS.
* Rock integration suite: 111/115 pass, up from 102/115. The
remaining 4 failures (rock-reduce-max-case{1..4}.mlir) abort with
SmallPtrSet "Bucket < End", an unrelated downstream bug in the
reduce-max lowering -- they pre-date this branch.
* Original gfx942 repro and rock.arch="native" attribute path still
pass.
* 102/102 rock lit (excl. integration) and 111/111 MLIRRockUnitTests
still pass.
Made-with: Cursor
…_rocm_runtime
`xmir-runner` link-pulls libMLIRRocmExecutionEngineUtils.so for
RocmSystemDetect, which in turn link-pulled libamdhip64 -> libamd_comgr
-> ROCm's libLLVM.so.23. The chain was loaded at process start before
main() ran, and libLLVM.so.23's static initializers then unified
cl::SubCommand against rocMLIR's embedded libLLVMSupport.so.22, aborting
in `_dl_init` with a SmallPtrSet "Bucket < End" assertion (the iterator
walked off the end of a SmallPtrSet that the two LLVMs disagreed about).
Fix:
1. Drop the link-time `hip::host hip::amdhip64` dependency from
MLIRRocmExecutionEngineUtils. Only the HIP headers are kept, via
`hip_INCLUDE_DIR`. RocmSystemDetect.cpp now resolves the two HIP
entry points it actually uses (`hipGetDeviceCount`,
`hipGetDevicePropertiesR0600`) through a dlmopen(LM_ID_NEWLM, ...)
handle, mirroring the pattern in
mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp and the preceding patch to
libmlir_rocm_runtime.so.
2. Coordinate with libmlir_rocm_runtime.so so we never dlmopen
libamdhip64 twice in the same process. KFD allows only one
user-space HSA session per process; a second dlmopen lands in a
fresh link-map namespace, gets its own HIP/HSA copy, and every
call from that copy returns hipErrorNoDevice.
RocmSystemDetect runs first (during JitRunner's mlirTransformer
step, before --shared-libs are dlopen'd) and is therefore the
canonical owner of the HIP namespace. It exports a small C ABI
`mlirRocmSystemDetectGetHipHandle()` returning its dlmopen'd
handle. RocmRuntimeWrappers.cpp dlsyms that symbol via RTLD_DEFAULT
when its `mgpu*` table is being built, and reuses the handle if
present. If the symbol is absent (host binary that does not link
RocmSystemDetect), it transparently falls back to its own dlmopen.
Verified end-to-end:
* `xmir-runner` ldd is now clean -- no libamdhip64, no libamd_comgr,
no libLLVM.so.23 in the transitive closure.
* mlir/test/Dialect/Rock integration suite: 115/115 pass (was
102/115; the 4 reduce_max + ~9 other JIT tests are now green).
* mlir/test/xmir suite: 59/61 pass, 1 XFAIL, 1 FAIL. The remaining
FAIL (bert_part_48) is "Error loading library: ocml.bc: Unknown
attribute kind (106) (Producer: 'LLVM23.0.0git' Reader: 'LLVM
22.0.0git')" -- an ROCm-vs-rocMLIR LLVM-IR version skew, not
related to this work.
* mlir/test/{Dialect,Conversion,CAPI,rocmlir-*,migraphx_models,
xmir,mlir-rock-lib} broader suite: 368/380 (96.84%) pass, was
300/380. The remaining failures are 1 bert_part_48 above and 1
rocmlir-tuning-driver/benchmark-config.mlir; the latter has the
same root cause but lives in a separate tool that direct-links
~30 HIP/HIPRTC entry points and needs its own architectural
refactor.
* 102/102 rock lit (excl. integration) and 111/111 MLIRRockUnitTests
still pass.
* Original repro (`rocmlir-driver --kernel-pipeline=migraphx,highlevel
--arch gfx942`) and `rock.arch="native"` both pass.
Made-with: Cursor
`rocmlir-tuning-driver` link-pulled `hip::host hip::amdhip64
hiprtc::hiprtc` directly. The transitive chain `libamdhip64 ->
libamd_comgr -> libLLVM.so.<major>` made the dynamic linker map ROCm's
monolithic libLLVM into the host process at startup. ROCm's libLLVM ran
its `cl::opt` static initializers before main(), unifying its
`cl::SubCommand` registry with rocMLIR's split-shared-libs
libLLVMSupport, and the resulting iterator inconsistency aborted the
process in `_dl_init` with "SmallPtrSet: Bucket < End". This affected
the only previously-failing lit test:
mlir/test/rocmlir-tuning-driver/benchmark-config.mlir.
Resolve every hipXXX/hiprtcXXX entry point at run time instead:
* Add `HipDelayLoad.{h,cpp}` (rocmlir::tuningdriver namespace) that
exposes `getHipSymbols()` / `getHiprtcSymbols()` accessors.
`HipDelayLoad.cpp` resolves libamdhip64 / libhiprtc via
`dlmopen(LM_ID_NEWLM, ...)` on glibc, plain `dlopen(RTLD_LAZY |
RTLD_LOCAL)` on other POSIX, and `LoadLibraryA` on Windows.
* Coordinate with libMLIRRocmExecutionEngineUtils via the existing
`mlirRocmSystemDetectGetHipHandle` cross-library back-channel: if
RocmSystemDetect has already loaded HIP into a private namespace,
reuse its handle. KFD permits exactly one user-space HSA session
per process, so a second dlmopen would otherwise return
`hipErrorNoDevice` for every call. HIPRTC is loaded into the same
link-map namespace as HIP via `dlinfo(RTLD_DI_LMID)` so the JITed
instruction-cache flush kernel sees the same device handles.
* Add `HipDelayLoadMacros.h`, included LAST in each consumer .cpp to
redirect bare `hipMalloc(...)` / `hiprtcCompileProgram(...)` etc.
to the function-pointer table dispatch. This keeps
rocmlir-tuning-driver.cpp and CacheFlush.cpp byte-equivalent to
their upstream-style HIP code; only the include block at the top
grows by two lines per file. The `hipExtModuleLaunchKernel` macro
injects the trailing `flags=0` argument that the HIP header would
otherwise supply as a default (default arguments do not apply when
calling through a function pointer).
* CMake: drop `target_link_libraries(... hip::host hip::amdhip64
hiprtc::hiprtc)`; keep the find_package calls and switch to
include-only wiring via `hip_INCLUDE_DIR` /
`hiprtc::hiprtc INTERFACE_INCLUDE_DIRECTORIES` plus the
`__HIP_PLATFORM_AMD__=1` define. Add `${CMAKE_DL_LIBS}` for the
dlmopen/dlsym calls in HipDelayLoad.cpp.
* CMake: drop the `benchmark-driver-utils` link-line entry. It was
dead code -- rocmlir-tuning-driver does not include
`benchmarkUtils.h` or call any `benchmark::*` symbol. Keeping it
pulled libamdhip64 in transitively (via `target_link_libraries(
benchmark-driver-utils PUBLIC hip::host hip::amdhip64)`), so its
removal was load-bearing for breaking the HIP transitive chain.
Verified end-to-end on the dlopenHip branch over LLVM 23:
* `readelf -d build/bin/rocmlir-tuning-driver` is now clean -- no
libamdhip64, libamd_comgr, libhiprtc, or libLLVM.so.23 in the
direct or transitive closure.
* mlir/test/rocmlir-tuning-driver/benchmark-config.mlir now PASSES.
* Full broad lit (Dialect, Conversion, CAPI, rocmlir-*, xmir,
migraphx_models, mlir-rock-lib): 371/381 pass, 2 XFAIL, 8
Unsupported, 0 failures (was 370/381 + 1 fail).
* rock integration suite: 115/115 pass.
* xmir suite: 60/61 pass + 1 XFAIL.
* MLIRRockUnitTests: 111/111 pass.
* Original repro
`rocmlir-driver --kernel-pipeline=migraphx,highlevel --arch gfx942`
still passes; `rock.arch="native"` still passes.
Made-with: Cursor
…helper
Header-only utility that consolidates the dlmopen / dlopen / LoadLibraryA
scaffolding that was independently re-implemented in four places on the
dlopenHip branch:
* external/.../ExecutionEngine/RocmRuntimeWrappers.cpp
* external/.../ExecutionEngine/RocmSystemDetect.cpp
* mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp
* mlir/tools/rocmlir-tuning-driver/HipDelayLoad.cpp
Each had its own `kHipCandidates` SONAME list, `osOpen`/`osSym` shim,
and (in three of the four) its own copy of the
`mlirRocmSystemDetectGetHipHandle` RTLD_DEFAULT-coordination dance.
Provides:
* `loadRocmLibrary(Library, relatedHandle, CoordinationPolicy)`
- `Library::Hip` defaults to `CoordinationPolicy::Auto`, which
consults `mlirRocmSystemDetectGetHipHandle` via RTLD_DEFAULT
first so all callers share one HSA session per process (KFD only
permits one). RocmSystemDetect itself opts out via
`CoordinationPolicy::Owned`.
- `relatedHandle` is used for HIPRTC/HSA so they are dlmopen'd into
the *same* link-map namespace as HIP via dlinfo(RTLD_DI_LMID).
That preserves the single-HSA-session invariant for satellite
libraries.
* `resolveRocmSymbol(LoadedLibrary, name)` -- thin dlsym/GetProcAddress
wrapper.
* Platform candidate lists (`libamdhip64.so.{7,6}` / decorated
`amdhip64_*.dll`, etc.) declared once.
Header-only on purpose: no new CMake target, no new linkage edge, no
ABI to version. Each consumer TU gets its own static copy of the
inline functions; the only cross-TU contract is the
`mlirRocmSystemDetectGetHipHandle` extern-C symbol that
`RocmSystemDetect.cpp` exports for RTLD_DEFAULT lookup. That stays
unchanged.
Two correctness improvements vs the per-file copies:
1. AmdArchDb (in MLIRRockOps) previously did its own dlmopen,
bypassing the RocmSystemDetect coordination. In a process that
loads both libraries (e.g. a future xmir-runner that also lowers
`rock.arch = "native"`), KFD's one-session limit would have
caused the second dlmopen to return `hipErrorNoDevice` for every
call. The new helper makes the coordination automatic for every
consumer.
2. HSA was previously dlmopen'd into a fresh namespace by AmdArchDb,
even though HIP was already loaded into another namespace by the
same TU. With `relatedHandle`, HSA shares HIP's namespace via
dlinfo(RTLD_DI_LMID), keeping both on a single KFD session.
Made-with: Cursor
Replace the per-file dlmopen / dlopen / LoadLibraryA scaffolding in
RocmRuntimeWrappers.cpp and RocmSystemDetect.cpp with calls to the
shared `mlir::rocm_loader` helpers. The two files now share:
* One copy of the SONAME candidate lists.
* One copy of the `osOpen` / `osSym` / `_GNU_SOURCE` / `<dlfcn.h>`
boilerplate.
* One copy of the `mlirRocmSystemDetectGetHipHandle` RTLD_DEFAULT
coordination dance (RocmSystemDetect uses
`CoordinationPolicy::Owned` to opt out, since it *is* the
canonical owner of the handle).
Net delta per file:
* RocmRuntimeWrappers.cpp: -91 lines (was 75 lines of scaffolding,
now 1 call to `loadRocmLibrary`)
* RocmSystemDetect.cpp: -68 lines (was 60 lines of scaffolding +
handle-coordination comment, now 3 lines)
Drive-by cleanups in the same diff:
* Remove the unused `#include <mutex>` from RocmRuntimeWrappers.cpp.
* Guard `#pragma GCC diagnostic` in RocmSystemDetect.cpp behind
`__GNUC__ && !__clang__` so MSVC does not emit `C4068: unknown
pragma` on a Windows build.
Behaviour is unchanged on glibc Linux (same dlmopen LM_ID_NEWLM, same
symbol search order); test results are identical (`ninja check-rocmlir`
1337/1337 pass, fat-lib build 472/472 pass).
Made-with: Cursor
Replace the per-file dlmopen scaffolding in `MLIRRockOps`'s
`AmdArchDb.cpp` and `rocmlir-tuning-driver`'s `HipDelayLoad.{h,cpp}`
with calls to the shared `mlir::rocm_loader` helpers (added in the
preceding [EXTERNAL] commit). Both rocMLIR-side TUs now coordinate
automatically with `RocmSystemDetect` through
`mlirRocmSystemDetectGetHipHandle`, so a process that loads
`MLIRRockOps` (for `rock.arch = "native"` lowering) and
`MLIRRocmExecutionEngineUtils` (for runner target detection) shares a
single HSA session instead of racing against KFD's
one-session-per-process limit.
Behavioural changes:
1. AmdArchDb's HIP loader now consults
`mlirRocmSystemDetectGetHipHandle` via RTLD_DEFAULT before
dlmopen'ing its own libamdhip64. This was the latent bug
identified in review: with the prior code, a future xmir-runner
path that lowered `rock.arch = "native"` would have hit
`hipErrorNoDevice` for every HIP call from RockOps. The
coordination now matches what `libmlir_rocm_runtime.so` and the
tuning driver already do.
2. AmdArchDb's HSA loader now passes HIP's handle to
`loadRocmLibrary(Library::Hsa, hipHandle)`, so HSA dlmopens into
HIP's link-map namespace via dlinfo(RTLD_DI_LMID) and shares the
same KFD session. Previously HSA would dlmopen LM_ID_NEWLM and
`hsa_init()` could (silently) end up tied to a separate session.
Drive-by cleanups in the same diff:
* AmdArchDb.cpp: replace `std::unordered_map<std::string,...>` with
`llvm::StringMap` per LLVM coding standards.
* AmdArchDb.cpp: drop the `#include <unordered_map>` and `<cstring>`
that became dead after the refactor.
* AmdArchDb.cpp: fix the file-banner typo (`Dtabase` -> `Database`).
* HipDelayLoad.cpp: drop the redundant `#include <hip/hip_ext.h>`
(already pulled in via HipDelayLoad.h) and the now-unused
`<dlfcn.h>` / `<windows.h>` blocks.
* MLIRRockOps CMakeLists: guard `target_include_directories(
obj.MLIRRockOps ...)` with `if (TARGET ...)` for symmetry with the
upstream ExecutionEngine CMake; iterate over both target names so
HIP/HSA include dirs land on whichever is the actual compile
target.
Net delta:
* AmdArchDb.cpp: -116 lines (180 lines of dlmopen scaffolding +
HipRuntime/HsaRuntime ctors collapsed to 1 call each).
* HipDelayLoad.cpp: -118 lines (the entire SONAME + dlmopen +
dlinfo + osSym block deleted; only the function-pointer table
population remains).
* HipDelayLoad.h: switch `void *handle` to
`mlir::rocm_loader::LoadedLibrary lib` so the helper's opaque type
is the contract.
Verified: build/ (BUILD_SHARED_LIBS=ON) `check-rocmlir` 1337/1337 pass
+ 5 XFAIL; build_migx/ (BUILD_FAT_LIBROCKCOMPILER=ON) `check-rocmlir`
472/472 pass + 3 XFAIL. MLIRRockUnitTests 111/111. Original gfx942
repro and `rock.arch="native"` paths both still pass.
Made-with: Cursor
Splits the formerly header-only `RocmDynamicLoader.h` into a small
public header and an implementation file so it is fit for upstream
LLVM submission:
- Public header (`RocmRuntimeLoader.h`) is platform-agnostic. It no
longer pulls in `<windows.h>` (which leaks `min`/`max`/`ERROR`
macros) nor defines `_GNU_SOURCE` (a feature-test macro that has
no business in an installed MLIR header).
- Implementation (`RocmRuntimeLoader.cpp`) keeps `<windows.h>` and
`_GNU_SOURCE` confined and matches upstream LLVM's
`Support/Windows/DynamicLibrary.inc` pattern: UTF-8 SONAMEs are
converted to UTF-16 via `convertUTF8ToUTF16String` and passed to
`LoadLibraryW`. POSIX paths use `dlmopen(LM_ID_NEWLM, ...)` on
glibc and degrade to `dlopen(RTLD_LAZY|RTLD_LOCAL)` elsewhere
with a one-time `LLVM_DEBUG` advisory.
- Replaces a zero-element `kHsaCandidates[]` on Windows (ill-formed
in standard C++; rejected by MSVC) with a `(nullptr, 0)` sentinel.
- Adds an `llvm_unreachable` for the impossible `Library` enum
case.
- Builds the new TU as a small static-library target
(`MLIRRocmRuntimeLoader`) so consumers do not pay the cost of a
new dynamic dependency. `mlir_rocm_runtime` and
`MLIRRocmExecutionEngineUtils` link against it.
Documents the cross-library coordination ABI on the canonical
declaration: `mlirRocmSystemDetectGetHipHandle` is now declared in
`RocmSystemDetect.h` with full rationale (KFD's one-HSA-session-per-
process limit) and is used by the loader via `RTLD_DEFAULT` lookup so
no link-time dependency on `MLIRRocmExecutionEngineUtils` is needed.
Removes the now-stale `DISABLE_PCH_REUSE` "TODO: workaround" comment
on `mlir_rocm_runtime`. Symbol containment is enforced at link time
by the version script (`mlir_rocm_runtime.map`); the dynsym contract
is verified by a regression test added in a follow-up commit.
Made-with: Cursor
Updates the rocMLIR-side delay-load consumers (`AmdArchDb.cpp` for
`rock.arch = "native"`, and `rocmlir-tuning-driver`'s `HipDelayLoad`
helpers) to include the new header
`mlir/ExecutionEngine/RocmRuntimeLoader.h` and to link against the
new static library target `MLIRRocmRuntimeLoader`.
This is purely a rename + link-target switch; behaviour is identical.
The transitive `${CMAKE_DL_LIBS}` dependency that was previously
declared on each consumer is now exposed `PUBLIC`-ly by
`MLIRRocmRuntimeLoader` itself, so the redundant per-consumer link
lines are dropped.
Made-with: Cursor
Closes the test-coverage gap on three independent invariants of the multi-LLVM isolation design: 1. `Dialect/Rock/Loader/no_rocm_neededs.test` -- asserts that the shipped binaries (`rocmlir-driver`, `rocmlir-opt`, `rocmlir-gen`, `rocmlir-tuning-driver`, `xmir-runner`, `mlir-runner`) and the ROCm-aware shared libraries (`libMLIRRockOps`, `libMLIRRocmRuntimeLoader`, `libmlir_rocm_runtime`, `libMLIRRocmExecutionEngineUtils`) never grow a `DT_NEEDED` entry for `libamdhip64`, `libhiprtc`, `libamd_comgr`, or ROCm's monolithic `libLLVM.so`. Any such regression would defeat the entire delay-load design by re-introducing the `cl::opt` / SmallPtrSet collisions at process startup. 2. `Dialect/Rock/Loader/dynsym_only_mgpu.test` -- asserts that `libmlir_rocm_runtime.so` exports nothing but the `mgpu*` C ABI entry points required by upstream MLIR's GPU lowering. This is what the `mlir_rocm_runtime.map` version script is supposed to guarantee; the test pins that contract so an accidental new source file or transitive LLVM dep does not silently widen the dynsym. 3. `RocmRuntimeLoaderTests` (gtest) -- pins the `CoordinationPolicy::Auto` contract: the second consumer to call `loadRocmLibrary(Hip)` MUST observe the same handle as `RocmSystemDetect`, otherwise KFD's "one HSA session per process" rule starts handing out `hipErrorNoDevice`. Also covers the `Owned` policy (used by the canonical owner to break recursion) and the soft-failure semantics of `resolveRocmSymbol` on a null handle. Extends `native_arch.mlir` with a `rock.arch = "native:0"` case to cover the explicit-device-id parsing path of the `--arch native:N` runtime resolver. Adds a `linux` lit feature (set when `platform.system() == "Linux"`) for the two ELF-format tests; they `// REQUIRES: linux` so the suite stays portable to macOS / Windows configurations. Also adds a `%rocmlir_shlib_dir` lit substitution derived from the `host_ldflags` rpath so the cleanliness scripts can locate the build's library output directory without hard-coding `lib/` vs `lib64/`. The new gtest sub-suite is conditionally compiled: it requires `MLIRRocmExecutionEngineUtils` as a separate target, which only exists in the standard shared-library configuration. Under `BUILD_FAT_LIBROCKCOMPILER=ON` the symbols are not exposed separately and the sub-suite is omitted; the contract is identical in both configurations and is exercised end-to-end by the broader regression suite. Made-with: Cursor
Removes the hardcoded `libamdhip64.so.7` / `libamdhip64.so.6` /
`hiprtc0700.dll` etc. SONAME lists and replaces them with a generated
candidate list:
1. The bare unversioned SONAME (`libamdhip64.so` / `amdhip64.dll`)
comes first. This is what `find_package(hip)`, IREE's HIP HAL
and Triton's HIP loader use. Standard ROCm installs always ship
the unversioned symlink, so this is the fast path on every
supported configuration.
2. Numeric versioned SONAMEs (`libamdhip64.so.<MAJOR>` for MAJOR
descending from `kMaxProbedRocmMajor` (99) to 1) are tried as a
fallback for runtime-only deployments where the unversioned
alias has been stripped. A failing `dlopen` of a missing SONAME
costs ~5us on glibc, so the worst-case 99-element walk is sub-
millisecond and is paid only once at startup.
3. Windows HIPRTC is decorated as `hiprtc<MM>00.dll` (e.g.
`hiprtc0700.dll` for ROCm 7.0); we generate that pattern for
each candidate major instead of hardcoding two majors.
The same approach extends to HSA: `libhsa-runtime64.so` first, then
descending numeric majors. (HSA has stayed on `.so.1` for the entire
ROCm 4.x-7.x window, but iterating costs nothing if AMD ever bumps
it.)
Result: this loader is built once and works against any ROCm major
version present at runtime -- ROCm 4.x through any future ROCm
release, without code changes (so long as AMD stays at or below
`kMaxProbedRocmMajor`, which is `static_assert`-ably easy to bump).
Documents the version-compatibility contract on the public header
under "ROCm version compatibility" so reviewers can verify the
forward/backward-compatibility story without reading the
implementation.
Made-with: Cursor
Removes hardcoded LLVM- and rocMLIR-version filename suffixes from
the cleanliness lit tests, so a future LLVM merge or rocMLIR version
bump does not silently turn the assertions into no-ops.
Test infrastructure:
- `check_no_rocm_neededs.sh` now globs library *base names* (e.g.
`libMLIRRockOps.so*`, `libmlir_rocm_runtime.so*`) and audits
every match. The script reports the artefact count it actually
checked, fails loudly if zero artefacts were found, and skips
bare `.so` developer symlinks to avoid double-reporting the same
NEEDED set. Tested against a synthesized "future LLVM 99 +
rocMLIR 42" tree to confirm it picks up arbitrary version
suffixes.
- `check_dynsym_only_mgpu.sh` likewise globs
`libmlir_rocm_runtime.so*` instead of pinning `.so.23.0git`,
falling back to the unversioned dev symlink as a last resort.
- `lit.site.cfg.py.in`'s `%rocmlir_shlib_dir` substitution now
parses the rpath out of `host_ldflags` and falls back through
four well-known build-tree library directories
(`<obj>/lib`, `<obj>/lib64`, `<obj>/external/llvm-project/lib`,
`<llvm_obj>/lib`) instead of hardcoding `lib`.
Unit test:
- Adds `RocmRuntimeLoaderVersionContract.FindsHipWithoutVersionHardcoding`,
which proves the loader resolves HIP without compile-time
knowledge of the ROCm major version. The test uses the `Auto`
policy (the path downstream consumers actually take) to avoid
KFD's "one HSA session per process" limit, which would otherwise
mask a true loader failure under a spurious `dlmopen` failure
from repeated `Owned` calls. Verifies that `hipGetDeviceCount`
-- a symbol present on every HIP major since ROCm 1.x --
resolves through the returned handle, pinning that the loaded
library is actually HIP and not some other library that happened
to match the SONAME pattern.
Made-with: Cursor
The header was renamed to `RocmRuntimeLoader.h` in commit e61ae4a, but three file-level docblocks still referenced the pre-rename name. They are comments only -- no code or include-path impact -- but the stale text confuses a reviewer who tries to follow the cited path. Updated in `RocmRuntimeWrappers.cpp`, `RocmSystemDetect.cpp` and the rocMLIR-side `AmdArchDb.cpp`. Made-with: Cursor
Strengthens the `CoordinationPolicy::Owned` doc comment to spell out that the policy is reserved for `RocmSystemDetect` and must NOT be called from elsewhere. Otherwise a downstream reader would see the parameter and assume it is the more direct, simpler policy to use -- when in fact a second `Owned` invocation in the same process opens HIP into a fresh `dlmopen(LM_ID_NEWLM, ...)` namespace, hits KFD's "one HSA session per process" limit, and starts handing out `hipErrorNoDevice` from every subsequent HIP call. This caveat surfaced while writing the version-agnosticism contract test: an `Owned`-using test happened to run after the canonical owner had already loaded HIP, and the loader silently produced a "working" handle that returned no devices. The test was switched to use `Auto` (the policy downstream consumers should use); the doc now blocks anyone else from making the same mistake. Made-with: Cursor
Two upstream-style polish fixes to make the loader feel native to a
LLVM/MLIR reviewer:
- Replace the hand-rolled `MLIR_ROCM_SHARED_HIP_EXPORT` macro
(`__declspec(dllexport)` on Windows / `visibility("default")`
elsewhere) with the canonical `LLVM_ALWAYS_EXPORT` from
`llvm/Support/Compiler.h`. Bit-for-bit identical expansion, but
the upstream macro is what reviewers expect and is consistent
with how MLIR runtime wrappers (`CudaRuntimeWrappers.cpp`, etc.)
publish their entry points. Updated the public docstring on
`mlirRocmSystemDetectGetHipHandle` to reference the macro by name
so a reader can grep its definition.
- Document why `RocmRuntimeLoader.cpp` does NOT use the standard
`llvm::sys::DynamicLibrary` wrapper. `sys::DynamicLibrary` always
passes `RTLD_LAZY | RTLD_GLOBAL` to `dlopen` (see
`lib/Support/Unix/DynamicLibrary.inc`) and there is no public
knob to swap in `RTLD_LOCAL` or `dlmopen(LM_ID_NEWLM, ...)`.
`RTLD_GLOBAL` is exactly the thing the loader is built to avoid:
it lets the dynamic linker unify ROCm's `libLLVM.so` symbols
with the host's embedded LLVM, which is the original `cl::opt`
collision being fixed. The implementation below intentionally
drops to raw OS APIs with the flags we actually need
(`dlmopen(LM_ID_NEWLM, ..., RTLD_LAZY)` on glibc,
`dlopen(..., RTLD_LAZY | RTLD_LOCAL)` on other POSIX,
`LoadLibraryW` on Windows). The new file-level comment makes
this deliberate deviation explicit so a future reviewer does not
"fix" it by switching to `sys::DynamicLibrary`.
No behavioural change. Symbol table is unchanged
(`mlirRocmSystemDetectGetHipHandle` still exported with default
visibility), all 4 loader unit tests still pass, the dynsym /
NEEDED-cleanliness lit suite is unchanged.
Made-with: Cursor
Removes accumulated indirection that the file picked up across several
review-driven rewrites. No behavioural change.
- Collapse 6 helper functions (`windowsLoadLibrary`,
`posixOpenIsolated`, `posixOpenInSameNamespace`,
`warnIfWeakIsolation`, `openIsolatedImpl`,
`openInSameNamespaceImpl`) into 2 (`openIsolated`,
`openInRelatedNamespace`). The two `*Impl` wrappers were
dispatching to a single named impl per platform, adding a layer
of named indirection without value; the platform branches now
live directly inside `openIsolated`.
- Drop the unreachable defensive `if (existingHandle && ...)`
inside `posixOpenInSameNamespace`. The caller (now
`openInRelatedNamespace`) already routes the null case to
`openIsolated`, so the inner check could not fire.
- Inline `warnIfWeakIsolation` into `openIsolated`'s POSIX branch
-- the function existed only to guard a one-time advisory whose
body and call site together cost more than the advisory itself.
- Factor the bare + numeric-fallback enumeration into a small
`appendCandidates(out, bare, joiner)` helper. Each `Library`
case in `candidatesFor` shrinks from a 4-line `emplace_back +
for-loop` block to a 3-line lambda call, and the platform-
specific `joiner` lambda is the only place that knows how the
versioned SONAME is composed.
- Use `DEBUG_TYPE` string literal in `LLVM_DEBUG` messages instead
of repeating the literal `"rocm-runtime-loader: "` prefix. The
literal-concatenation idiom matches what `lib/Support/*.cpp`
do.
- `using namespace mlir::rocm_loader;` at file scope so the
enum / struct names don't need the prefix in every `case`.
Net: -28 lines, -3 helper functions, -1 redundant indirection layer,
-1 unreachable code path. All 4 loader unit tests + the dynsym /
NEEDED-cleanliness lit suite still pass on both build trees
(default shared and `BUILD_FAT_LIBROCKCOMPILER=ON`).
Made-with: Cursor
Both scripts collapse a triple-nested bash loop (artefact x forbidden-prefix x
readelf-line / artefact x export-symbol x case-pattern) into a single
`awk` / `grep -E` pipeline per artefact.
- `check_no_rocm_neededs.sh`: 111 -> 72 lines (-35%). Forbidden
SONAMEs are now a single anchored regex
(`^lib(amdhip64|hiprtc|amd_comgr|LLVM\.so)`) instead of an array
of prefixes consumed by an inner `case ${forb}*` (which only
worked because bash unquoted-expanded `${forb}*` as a glob, an
easy thing for the next reader to misunderstand). Tool/library
artefact lists become inline `for` arguments.
- `check_dynsym_only_mgpu.sh`: 70 -> 61 lines. The
bash `case mgpu* / *` partition becomes a single `awk` script
that emits an `ok=N` count and a multi-line `BAD\n sym1\n sym2`
block; the bash code then just greps for `^BAD`.
Verified end-to-end: success, skip-on-missing, and forbidden-NEEDED
detection all still work (the latter checked by linking
`/opt/rocm/lib/libamdhip64.so.7` -- which carries `libamd_comgr.so.3`
in its NEEDED set -- as `libMLIRRocmRuntimeLoader.so.99` and
confirming the script reports `FAIL: ... declares NEEDED
libamd_comgr.so.3` and exits non-zero).
Made-with: Cursor
A close re-read of the previous two simplification commits (68bb69e, 8195f8) surfaced five issues, three of them real bugs: 1. (REAL) `check_no_rocm_neededs.sh` silently passed when both directory arguments were bogus. The "no artefacts found" guard could not fire because the tool slots were always added to the `artefacts` array unconditionally, then later silently skipped by a `[ -e "${art}" ] || continue` check. A run with `tools_dir=/nonexistent shlib_dir=/nonexistent` reported "0 artefacts checked, all clean" and exited 0, masking any misconfigured lit substitution. Tools are now added only when they actually exist, mirroring the shlib-glob pattern, and the wrong-directory failure message names the lit substitutions to check. 2. (REAL) `check_no_rocm_neededs.sh` `forbidden_re` could false-positive-match SONAMEs that share a prefix but a different basename, e.g. a hypothetical `libamd_comgr_helper.so.1` would match because the previous `[._-]` separator class accepted `_` (which is part of the basename, not a version separator). Tightened the regex to anchor on `<name>.{so,dll}` followed by either end-of-string or a real version separator (`.` / `-`). Added a tabulation of true/false-positive cases to the comment so the discriminator is explicit. 3. (REAL, cosmetic) `check_dynsym_only_mgpu.sh` emitted a stray blank line in its failure output -- the `BAD\n sym1\n sym2` awk format, after `sed 's/^BAD//'`, left an empty first line. Reformatted the awk to emit `OK <sym>` and `BAD <sym>` per line, which removes the sentinel and lets the bash side use plain `grep -c` / `cut`. 4. (Code quality) Dropped the `static bool once = []() { LLVM_DEBUG(...); }();` advisory wrapper in `RocmRuntimeLoader .cpp`'s non-glibc POSIX path. `LLVM_DEBUG` is itself gated by `-debug-only=rocm-runtime-loader`, so the inner advisory already only fires when a developer has opted into the noise; suppressing it to once-per-process via a static bool is over-engineered. Replaced with a plain `LLVM_DEBUG(...)` that prints once on each load attempt under `-debug-only=`, which is exactly what every other LLVM_DEBUG advisory does. 5. (Code quality) Replaced `#if defined(__GLIBC__) && !defined(_WIN32)` with the single-condition `#if defined(__GLIBC__)` -- glibc is never defined on Windows, so the second clause was dead. Comment updated to make the implicit assertion explicit. Verification: 115/115 unit tests still pass on the default build, 111/111 on fat-lib; the cleanliness lit suite passes on both. The B1 fix exercised by `check_no_rocm_neededs.sh /nonexistent_lib /nonexistent_tools` (now correctly exits 1 with a diagnostic). The B2 fix exercised by a tabulated regex test covering 6 false positives and 10 true positives. The B3 fix exercised by injecting `/opt/rocm/lib/libamdhip64.so.7` as a fake `libmlir_rocm_runtime .so.99` (now reports each non-mgpu symbol on its own indented line without the empty-spacer artefact). Made-with: Cursor
Cleanups identified by a fresh code review pass against the LLVM
Coding Standards. No behavioural change.
- `RocmRuntimeLoader.h`: drop dead `<cstddef>` include (no `size_t`
or `ptrdiff_t` referenced).
- `RocmRuntimeLoader.cpp`: replace the file-scope `using namespace`
pair (`mlir`, `mlir::rocm_loader`) with a single
`namespace mlir::rocm_loader { ... }` block surrounding both the
anonymous-namespace helpers and the public API. This is the
convention used elsewhere in `mlir/lib/ExecutionEngine/`
(`RocmSystemDetect.cpp`, `JitRunner.cpp`, ...) and avoids the
second using-directive that no other TU in the directory uses.
- `RocmRuntimeLoader.cpp`: drop the `auto *fn = reinterpret_cast<
GetHandleFn>(...)` form for the `RTLD_DEFAULT` lookup. `auto *`
with a function pointer compiles correctly but is unusual --
readers expect `auto *` to mean "pointer to object". Replaced
with plain `auto fn = ...` followed by an explicit `fn ? fn() :
nullptr` ternary, which keeps the call-side null check obvious.
- `RocmSystemDetect.cpp`: drop `#include "llvm/Support/Error.h"`
(no `Expected`/`Error`/`cantFail`/etc. used) and the dead
`#define DEBUG_TYPE "execution-engine-rocm-system-detect"`
(no `LLVM_DEBUG()` calls in this TU).
- `external/llvm-project/mlir/lib/ExecutionEngine/CMakeLists.txt`:
correct the misleading "Static library so it adds no transitive
NEEDED entries" comment on `MLIRRocmRuntimeLoader`. With
`BUILD_SHARED_LIBS=ON`, `add_mlir_library(... STATIC ...)`
produces a `.so`; the `STATIC` keyword is honored only in fat-
library configurations. Reworded to explain that either form is
safe because the only transitive dependency is `LLVMSupport`.
- Same CMake file: update two version-coupled comments
(`libLLVM.so.23` -> `libLLVM.so.<MAJOR>`) so a future LLVM
version bump does not leave stale text in the file. Also fixed a
`LoadLibraryA` -> `LoadLibraryW` doc typo (the implementation
has always called `LoadLibraryW`).
Made-with: Cursor
Cleanups identified by a fresh code review pass. No behavioural
change to the build; the lit scripts gain a new diagnostic but pass
the same tests.
CMake comments
- Root `CMakeLists.txt`, `mlir/lib/Dialect/Rock/IR/CMakeLists.txt`:
update version-coupled doc strings (`libLLVM.so.23` ->
`libLLVM.so.<MAJOR>`, `libLLVMSupport.so.*` ->
`libLLVMSupport.so.<MAJOR>git`) so a future LLVM bump does not
leave stale references in comments.
- `mlir/test/lit.site.cfg.py.in`: the comment claimed
`_resolve_rocmlir_shlib_dir` probes "the two well-known
candidate directories" but the implementation tries four
(`<mlir_obj>/lib`, `<mlir_obj>/lib64`,
`<mlir_obj>/external/llvm-project/lib`, `<llvm_obj>/lib`).
Updated.
- `mlir/test/Dialect/Rock/Loader/no_rocm_neededs.test`: the doc
referred to `libMLIRRocmRuntimeLoader.so` as "our static delay-
load helper" -- only true in fat-lib mode (in the default shared
build it is a `.so`). Reworded to explain that either form is
safe because the loader's transitive closure is just LLVMSupport.
Also corrected "pure shell + grep" -> "awk + grep" and tightened
the false-positive-rejection note.
Lit scripts
- `check_no_rocm_neededs.sh`, `check_dynsym_only_mgpu.sh`: split
the previous `readelf | awk | grep` (resp. `nm | awk`) pipelines
so that a `readelf` / `nm` failure no longer feeds empty input
through the rest of the pipe and gets silently reported as
"all clean". Each tool is now invoked with its exit status
captured; a failure emits an explicit `FAIL: readelf -d <art>
failed` (or the `nm` equivalent) and the script exits non-zero.
Verified by feeding a non-ELF text file as a synthetic artefact:
the script now correctly reports `FAIL: readelf -d ... failed`
and exits 1, instead of the previous behaviour of silently
succeeding.
Made-with: Cursor
Two latent bugs surfaced by a deeper review pass.
1. (Real) The artefact globs in `check_no_rocm_neededs.sh` and
`check_dynsym_only_mgpu.sh` matched `libfoo.so.*`, which
happily catches debug-info companions like
`libfoo.so.<version>.dwo` (split-DWARF, enabled by
`LLVM_USE_SPLIT_DWARF=ON`), `libfoo.so.<version>.debug`
(separate debug info) and `libfoo.so.<version>.dbg`. Feeding
such a file to `readelf -d` / `nm -D` produces unhelpful
output, and a `.dwo` is a stripped DWARF blob with no NEEDED
entries -- it would silently report "all clean" while
skipping the real shared object.
Tightened the glob to `libfoo.so.[0-9]*` (the version suffix
must start with a digit, which excludes `.dwo`/`.debug`/`.dbg`)
and added a defensive `case "${cand}" in *.dwo|*.debug|*.dbg)
continue ;; esac` for any unusual decoration that slips
through. The brackets must stay outside double-quotes for bash
to treat them as a character class -- documented in a comment
so the next reader does not "fix" the quoting and re-introduce
the bug.
Verified by creating a synthetic build directory containing
`libmlir_rocm_runtime.so.23.0git`, `.so.23.0git.dwo`, and
`.so.23.0git.debug`: both scripts now correctly select the
real `.so.23.0git` and report 25 mgpu* symbols / artefact-clean.
2. (Portability) `_resolve_rocmlir_shlib_dir` in `lit.site.cfg.py.in`
parsed only the `-Wl,-rpath -Wl,/path` two-argument spelling of
rpath-on-link, which is what GCC/clang emit on Linux when CMake
materialises `INSTALL_RPATH`. macOS and some clang invocations
emit the comma-form `-Wl,-rpath,/path` instead, which the regex
missed. The function then fell through to the four-way directory
probe -- on macOS that happens to find `<mlir_obj>/lib` so the
test still works, but the implicit fallback is fragile.
Added a second pattern for the comma-form. The first match in
either form wins. Verified against four host-ldflags shapes:
Linux space-form, macOS comma-form, mixed (both present), and
no-rpath.
Made-with: Cursor
Two correctness bugs in the `rock.arch = "native[:N]"` path, both caught by a second-pair-of-eyes review of this branch. Bug 1 (HIGH, multi-GPU correctness): `nativeArchInfo` cached queried `AmdArchInfo` by `gcnArchName`. On a same-arch multi-GPU system (e.g. two `gfx942` cards), the second device's call would hit the cache entry populated by the FIRST device and silently return its CU count, XCC count, and per-CU shared-memory -- masking real per-device variance from binning, fused-die configurations, etc. The fix keys the cache by `deviceId` (an `unsigned`, hence `llvm::DenseMap<unsigned, AmdArchInfo>`), and uses an explicit lock-acquire / find / drop-lock / query / re-acquire / `try_emplace` pattern so the expensive HIP/HSA query does not block other threads. Bug 4 (MEDIUM, parser): `parseArchString` (in `AmdArchDb.cpp`) and `ParamLookupTable<...>::normalizeArch` (in `ParamLookupTable.cpp`) both silently fell back to `deviceId = 0` on a malformed `native:<suffix>`. Inputs like `native:foo`, `native:1abc`, `native:`, or `native:-1` were treated as `native:0`, which on multi-GPU systems silently targeted the wrong device. The fixed parser distinguishes `native` (no colon, deviceId implicitly 0) from `native:` (separator with empty/invalid suffix), validates that the suffix parses as an unsigned integer that fits in `unsigned`, and `report_fatal_error`s with a precise diagnostic otherwise. Made-with: Cursor
The previous implementation of `loadHipSymbols()` / `loadHiprtcSymbols()` returned a partially-populated symbol table with `lib.handle = nullptr` when libamdhip64 / libhiprtc was missing on the loader path. The macros in `HipDelayLoadMacros.h` then dispatched call sites like `hipMalloc(...)` / `hiprtcCompileProgram(...)` straight through the (still null) function pointer, segfaulting at the call site instead of giving the user a controlled diagnostic. Tightened the contract to "either fully loaded or process aborts": both loaders now invoke a `[[noreturn]] abortMissing*` helper at the first sign of trouble (library missing, symbol missing). The helper writes a clear diagnostic (which library, which symbol) to stderr and `std::abort()`s. The tuning driver is fundamentally useless without HIP/HIPRTC -- every benchmark needs to launch a kernel on an AMD GPU -- so converting "silent crash later" into "loud diagnostic up front" is strictly an improvement. The header docstrings on `HipSymbols` / `HiprtcSymbols` / `getHipSymbols()` / `getHiprtcSymbols()` are updated to advertise the new fail-fast contract: `lib.handle` is non-null and every function pointer is non-null whenever these accessors return. Made-with: Cursor
`MLIRRocmExecutionEngineUtils` was unconditionally built and
unconditionally referenced `${hip_INCLUDE_DIR}` -- but
`find_package(hip)` (which populates `hip_INCLUDE_DIR`) was gated
by `MLIR_ENABLE_ROCM_RUNNER`, which is forced to `0` when
`BUILD_FAT_LIBROCKCOMPILER=ON`. Result: in fat-lib configurations,
the target relied on an empty / leftover `hip_INCLUDE_DIR` and
either failed to build or silently picked up an ambient include
path from cached CMake state.
Move the entire `MLIRRocmExecutionEngineUtils` definition (target,
warning-flag wiring, include dirs, link libs) inside the
`if (LLVM_ENABLE_PIC) ... if (MLIR_ENABLE_ROCM_RUNNER) ...` block
where `find_package(hip)` already lives. The fat-lib build does
not need this target at all -- `xmir-runner` is also gated off in
that mode -- and `mlir/unittests/Dialect/Rock/CMakeLists.txt`
already conditions its `RocmRuntimeLoaderTests` source list on
`if (TARGET MLIRRocmExecutionEngineUtils)`, so the missing target
in fat-lib is handled gracefully.
Verified by reconfiguring + building both
`build/` (default shared, `MLIR_ENABLE_ROCM_RUNNER=ON`) -- target
present, libMLIRRocmExecutionEngineUtils.so produced -- and
`build_migx/` (`BUILD_FAT_LIBROCKCOMPILER=ON`) -- target absent,
build clean, `MLIRRockUnitTests` correctly omits
`RocmRuntimeLoaderTests` and runs the rest.
Made-with: Cursor
Pin the contracts behind the four bug fixes from the prior commits:
- `mlir/test/Dialect/Rock/native_arch_invalid.mlir`: lit test
that runs `rocmlir-opt -rock-affix-params` on a function with
`rock.arch = "native:foo"` and asserts a fatal-error diagnostic.
Uses `not --crash` (rather than plain `not`) because
`report_fatal_error` aborts via SIGABRT, which `not` only
accepts under `--crash`. The parser fails before any HIP query
so the test runs anywhere (just `REQUIRES: linux` for the
`not --crash` semantics).
- `mlir/unittests/Dialect/Rock/AmdArchDbTests.cpp`: three new
gtest cases.
* `NativeArchParseTest.MalformedSuffixAborts` uses
`EXPECT_DEATH` to check that `lookupArchInfo("native:foo")`,
`"native:1abc"`, `"native:"`, and `"native:-1"` each abort
with the precise diagnostic. Death tests do not need a real
GPU since the parser fires before HIP is touched.
* `NativeArchParseTest.BareNativeIsDeviceZero` confirms that
`lookupArchInfo("native")` (no colon) is well-formed and
does NOT abort. Skipped when no GPU is present so the call
actually completes without aborting elsewhere.
* `NativeArchCacheTest.SameArchMultiGpuDistinct` exercises
the per-device cache contract on systems with two same-arch
GPUs: two distinct device ids must each return a value from
an independently-queried cache entry, never share device 0's
copy. Skipped when fewer than 2 same-arch GPUs are visible
(typical CI).
Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
This PR removes link-time dependencies on ROCm runtime libraries (notably libamdhip64 and transitively ROCm’s monolithic libLLVM.so) by introducing an isolated, delay-loading mechanism. This avoids the “two LLVMs in one process” static-initializer collision that crashes rocMLIR/MLIR tools at process startup on some ROCm installations.
Changes:
- Add
mlir::rocm_loader::loadRocmLibrary(...)+resolveRocmSymbol(...)and wire ROCm-aware components to resolve HIP/HIPRTC/HSA symbols at runtime (with link-map isolation viadlmopenon glibc). - Update build/linker settings (version script for
mlir_rocm_runtime,--exclude-libs,ALLfor tools,-Bsymbolic-functionsfor shared libs) to reduce symbol interposition/ODR collisions. - Add lit + gtest coverage to enforce “no ROCm DT_NEEDED” and “only mgpu* exports” contracts, and improve native-arch parsing/caching behavior.
Reviewed changes
Copilot reviewed 32 out of 33 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| mlir/utils/jenkins/static-checks/get_fat_library_deps_list.pl | Exclude certain libs from fat-archive dependency list generation. |
| mlir/unittests/Dialect/Rock/RocmRuntimeLoaderTests.cpp | New gtests for loader coordination and version-agnostic behavior. |
| mlir/unittests/Dialect/Rock/CMakeLists.txt | Conditionally add loader tests + link required targets. |
| mlir/unittests/Dialect/Rock/AmdArchDbTests.cpp | Refactor tests to avoid link-pulling HIP; add native parsing/cache tests. |
| mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp | Switch HIP calls to delay-load dispatch via macros. |
| mlir/tools/rocmlir-tuning-driver/HipDelayLoadMacros.h | Macro redirects for hip*/hiprtc* callsites through function tables. |
| mlir/tools/rocmlir-tuning-driver/HipDelayLoad.h | Delay-load HIP/HIPRTC API surface for tuning driver. |
| mlir/tools/rocmlir-tuning-driver/HipDelayLoad.cpp | Implementation of symbol resolution + fail-fast diagnostics. |
| mlir/tools/rocmlir-tuning-driver/CacheFlush.cpp | Use delay-loaded HIP/HIPRTC entry points. |
| mlir/tools/rocmlir-tuning-driver/CMakeLists.txt | Stop linking HIP/HIPRTC; use includes/defs + link runtime loader helper. |
| mlir/tools/rocmlir-lib/CMakeLists.txt | Document fat-lib runtime expectations for native arch. |
| mlir/test/lit.site.cfg.py.in | Add features/substitutions to support new loader/ELF auditing tests. |
| mlir/test/lit.cfg.py | Add .test suffix and exclude helper shell scripts from test discovery. |
| mlir/test/Dialect/Rock/native_arch_invalid.mlir | New lit test for invalid native:* arch handling. |
| mlir/test/Dialect/Rock/native_arch.mlir | Lit coverage for rock.arch="native" and "native:0" pipeline flow. |
| mlir/test/Dialect/Rock/Loader/no_rocm_neededs.test | Lit shell test asserting no forbidden ROCm runtime DT_NEEDED deps. |
| mlir/test/Dialect/Rock/Loader/dynsym_only_mgpu.test | Lit shell test asserting only mgpu* exports from runtime wrapper. |
| mlir/test/Dialect/Rock/Loader/check_no_rocm_neededs.sh | Helper script to audit DT_NEEDED entries via readelf -d. |
| mlir/test/Dialect/Rock/Loader/check_dynsym_only_mgpu.sh | Helper script to audit exports via nm -D. |
| mlir/lib/ExecutionEngine/CMakeLists.txt | Mark conv-validation-wrappers as partial sources intended. |
| mlir/lib/Dialect/Rock/Tuning/ParamLookupTable.cpp | Normalize native[:N] by resolving to hardware arch and caching. |
| mlir/lib/Dialect/Rock/IR/CMakeLists.txt | Remove HIP/HSA link deps; add include-only wiring and loader dependency. |
| mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp | Implement HIP/HSA delay-load, stricter parsing, and deviceId-keyed cache. |
| mlir/include/mlir/Dialect/Rock/IR/AmdArchDb.h | Expose nativeDeviceCount() and nativeArchName() APIs. |
| external/llvm-project/mlir/lib/ExecutionEngine/mlir_rocm_runtime.map | New linker version script to export only mgpu*. |
| external/llvm-project/mlir/lib/ExecutionEngine/RocmSystemDetect.cpp | Delay-load HIP and export shared HIP handle for coordination. |
| external/llvm-project/mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp | Delay-load HIP inside runtime wrappers; explicitly export mgpu APIs. |
| external/llvm-project/mlir/lib/ExecutionEngine/RocmRuntimeLoader.cpp | New isolated loader implementation (dlmopen/dlopen/LoadLibraryW). |
| external/llvm-project/mlir/lib/ExecutionEngine/CMakeLists.txt | Add MLIRRocmRuntimeLoader; remove link-time HIP deps; apply version script. |
| external/llvm-project/mlir/include/mlir/ExecutionEngine/RocmSystemDetect.h | Declare exported shared-HIP-handle accessor for coordination. |
| external/llvm-project/mlir/include/mlir/ExecutionEngine/RocmRuntimeLoader.h | Public API for ROCm delay-loading and symbol resolution. |
| cmake/llvm-project.cmake | Add --exclude-libs,ALL to tools to avoid re-exporting LLVM symbols. |
| CMakeLists.txt | Apply -Bsymbolic-functions globally for shared/module libs (non-Apple/non-Win). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| my %excludedLibs = map { $_ => 1 } qw( | ||
| conv-validation-wrappers | ||
| ); | ||
|
|
||
| my @rocmlirLibs; | ||
| my @mlirLibs; | ||
|
|
||
| my @deps = split /\n/,`ninja -t query lib/libMLIRRockThin.a`; | ||
| not $? or die "failed to get target dependencies"; | ||
|
|
||
| foreach (@deps) { | ||
| last if /outputs:/; | ||
| if (m#external/llvm-project/llvm/lib/lib(\w+)\.a#) { | ||
| push @mlirLibs, $1; | ||
| push @mlirLibs, $1 unless $excludedLibs{$1}; | ||
| } elsif (m#lib/lib(\w+)\.a#) { | ||
| push @rocmlirLibs, $1; | ||
| push @rocmlirLibs, $1 unless $excludedLibs{$1}; | ||
| } |
There was a problem hiding this comment.
%excludedLibs won’t actually exclude conv-validation-wrappers here because the regexes only capture lib(\w+)\.a (\w doesn’t match -). As a result, hyphenated library names are skipped entirely and the exclusion list is misleading/ineffective. Consider widening the capture to include - (e.g. [\w-]+) so the dependency is parsed and then filtered intentionally (or adjust the parsing to match Ninja’s actual output format).
| void appendCandidates(std::vector<std::string> &out, llvm::StringRef bare, | ||
| Joiner joiner) { | ||
| out.emplace_back(bare.str()); | ||
| for (unsigned m = kMaxProbedRocmMajor; m >= 1; --m) |
There was a problem hiding this comment.
appendCandidates uses for (unsigned m = kMaxProbedRocmMajor; m >= 1; --m), which never terminates for an unsigned loop variable (it wraps from 0 to UINT_MAX and m >= 1 remains true). This will hang when building the candidate list. Use a termination-safe form like for (unsigned m = kMaxProbedRocmMajor; m > 0; --m) or a signed counter.
| for (unsigned m = kMaxProbedRocmMajor; m >= 1; --m) | |
| for (unsigned m = kMaxProbedRocmMajor; m > 0; --m) |
| if (arch.consume_front("native")) { | ||
| static std::mutex m; | ||
| static llvm::StringMap<std::string> resolvedCache; | ||
| std::lock_guard<std::mutex> lock(m); | ||
| // Re-form the original key for the cache lookup. | ||
| std::string cacheKey = ("native" + arch).str(); | ||
| auto [it, inserted] = resolvedCache.try_emplace(cacheKey); | ||
| if (inserted) { | ||
| unsigned deviceId = 0; | ||
| if (arch.consume_front(":")) { | ||
| // Parse must succeed and fit in `unsigned`. Silently treating | ||
| // `native:foo` or `native:` as `native:0` would mask user error | ||
| // and target the wrong GPU on multi-GPU systems. | ||
| unsigned long long parsed = 0; | ||
| if (arch.empty() || llvm::getAsUnsignedInteger(arch, 0, parsed) || | ||
| parsed > std::numeric_limits<unsigned>::max()) | ||
| llvm::report_fatal_error( | ||
| Twine("Invalid `") + cacheKey + | ||
| "`: the suffix after `native:` must be a non-negative integer " | ||
| "device id (got `" + | ||
| arch + "`)."); |
There was a problem hiding this comment.
normalizeArch treats any string with prefix "native" as the special native-arch syntax because it uses arch.consume_front("native"). That means inputs like "nativefoo" would be interpreted as native (device 0) instead of being rejected, which is inconsistent with parseArchString in AmdArchDb and could mask user typos. Consider checking arch == "native" or arch.starts_with("native:") (and leaving other strings to the regular gfx... parsing).
| if (arch.consume_front("native")) { | |
| static std::mutex m; | |
| static llvm::StringMap<std::string> resolvedCache; | |
| std::lock_guard<std::mutex> lock(m); | |
| // Re-form the original key for the cache lookup. | |
| std::string cacheKey = ("native" + arch).str(); | |
| auto [it, inserted] = resolvedCache.try_emplace(cacheKey); | |
| if (inserted) { | |
| unsigned deviceId = 0; | |
| if (arch.consume_front(":")) { | |
| // Parse must succeed and fit in `unsigned`. Silently treating | |
| // `native:foo` or `native:` as `native:0` would mask user error | |
| // and target the wrong GPU on multi-GPU systems. | |
| unsigned long long parsed = 0; | |
| if (arch.empty() || llvm::getAsUnsignedInteger(arch, 0, parsed) || | |
| parsed > std::numeric_limits<unsigned>::max()) | |
| llvm::report_fatal_error( | |
| Twine("Invalid `") + cacheKey + | |
| "`: the suffix after `native:` must be a non-negative integer " | |
| "device id (got `" + | |
| arch + "`)."); | |
| if (arch == "native" || arch.starts_with("native:")) { | |
| StringRef nativeSuffix = arch; | |
| [[maybe_unused]] bool consumed = nativeSuffix.consume_front("native"); | |
| assert(consumed && "native arch prefix must be present"); | |
| static std::mutex m; | |
| static llvm::StringMap<std::string> resolvedCache; | |
| std::lock_guard<std::mutex> lock(m); | |
| // Re-form the original key for the cache lookup. | |
| std::string cacheKey = ("native" + nativeSuffix).str(); | |
| auto [it, inserted] = resolvedCache.try_emplace(cacheKey); | |
| if (inserted) { | |
| unsigned deviceId = 0; | |
| if (nativeSuffix.consume_front(":")) { | |
| // Parse must succeed and fit in `unsigned`. Silently treating | |
| // `native:foo` or `native:` as `native:0` would mask user error | |
| // and target the wrong GPU on multi-GPU systems. | |
| unsigned long long parsed = 0; | |
| if (nativeSuffix.empty() || | |
| llvm::getAsUnsignedInteger(nativeSuffix, 0, parsed) || | |
| parsed > std::numeric_limits<unsigned>::max()) | |
| llvm::report_fatal_error( | |
| Twine("Invalid `") + cacheKey + | |
| "`: the suffix after `native:` must be a non-negative integer " | |
| "device id (got `" + | |
| nativeSuffix + "`)."); |
| # Apply to both the object library (the actual compile target) and the | ||
| # aggregate shared library, for parity with how upstream ExecutionEngine | ||
| # wires include dirs for its HIP-touching TUs. | ||
| foreach(_t MLIRRockOps obj.MLIRRockOps) | ||
| if (TARGET ${_t}) | ||
| target_include_directories(${_t} SYSTEM PRIVATE ${hip_INCLUDE_DIRS}) | ||
| endif() | ||
| endforeach() | ||
| unset(_t) |
There was a problem hiding this comment.
This block uses ${hip_INCLUDE_DIRS}, but elsewhere in the repo (e.g. external/llvm-project/mlir/lib/ExecutionEngine/CMakeLists.txt) the HIP package is treated as providing ${hip_INCLUDE_DIR}. If ${hip_INCLUDE_DIRS} is unset for the ROCm HIP CMake package on some installations, this will fail to add the include path and break the build. Consider switching to ${hip_INCLUDE_DIR} (or guarding/falling back to whichever variable is defined).
| # Apply to both the object library (the actual compile target) and the | |
| # aggregate shared library, for parity with how upstream ExecutionEngine | |
| # wires include dirs for its HIP-touching TUs. | |
| foreach(_t MLIRRockOps obj.MLIRRockOps) | |
| if (TARGET ${_t}) | |
| target_include_directories(${_t} SYSTEM PRIVATE ${hip_INCLUDE_DIRS}) | |
| endif() | |
| endforeach() | |
| unset(_t) | |
| if (hip_INCLUDE_DIRS) | |
| set(_hip_include_dirs ${hip_INCLUDE_DIRS}) | |
| elseif(hip_INCLUDE_DIR) | |
| set(_hip_include_dirs ${hip_INCLUDE_DIR}) | |
| else() | |
| message(FATAL_ERROR | |
| "HIP CMake package was found, but neither `hip_INCLUDE_DIRS` nor " | |
| "`hip_INCLUDE_DIR` is defined.") | |
| endif() | |
| # Apply to both the object library (the actual compile target) and the | |
| # aggregate shared library, for parity with how upstream ExecutionEngine | |
| # wires include dirs for its HIP-touching TUs. | |
| foreach(_t MLIRRockOps obj.MLIRRockOps) | |
| if (TARGET ${_t}) | |
| target_include_directories(${_t} SYSTEM PRIVATE ${_hip_include_dirs}) | |
| endif() | |
| endforeach() | |
| unset(_t) | |
| unset(_hip_include_dirs) |
Motivation
Running any of rocMLIR's compiler tools (
rocmlir-driver,rocmlir-opt,xmir-runner,rocmlir-tuning-driver, MLIR'smlir-runnerJIT) against a system-installed with TheRock 7.12 crashes at process startup with one of:or, in release builds:
before
main()is even reached.The root cause is a static-initializer collision between two LLVMs in the same address space:
libLLVMSupport.so.<MAJOR>git, etc.libamdhip64transitively pulls inlibamd_comgr, which in turn pulls in ROCm's monolithiclibLLVM.so.<MAJOR>.Both LLVMs run their
cl::optstatic initializers; the dynamic linker unifies thecl::optglobal registry across the split-vs-monolithic libraries; the second registration trips the duplicate-option assertion. The same problem also manifests inmlir-runner/xmir-runnerJIT paths thatdlopenmlir_rocm_runtime.so(which itself linkedlibamdhip64).This PR removes every link-time dependency on
libamdhip64/libhsa-runtime64/libhiprtcfrom rocMLIR-side libraries and from upstream MLIR'smlir_rocm_runtimeandMLIRRocmExecutionEngineUtils, replacing them with a delay-load helper that opens the ROCm runtime in a private link-map namespace viadlmopen(LM_ID_NEWLM, ...)on glibc (and the equivalent on other platforms). The two LLVMs no longer share a symbol scope, thecl::optregistry stays single-instance, and the tools start cleanly against any ROCm install onLD_LIBRARY_PATH.The end-to-end command that reproduced the original crash (
./bin/rocmlir-driver --kernel-pipeline=migraphx,highlevel --arch gfx942 ../test.mlir) now runs without aborting.Technical Details
The branch is 33 commits, 15 of them tagged
[EXTERNAL]for upstream LLVM submission. Net diff: +2618 / −561 lines across 46 files.Architecture
A new public API in
mlir::rocm_loader::loadRocmLibrary(...)(headermlir/ExecutionEngine/RocmRuntimeLoader.h, implementation inlib/ExecutionEngine/RocmRuntimeLoader.cpp, built as a smallMLIRRocmRuntimeLoaderlibrary) opens HIP / HIPRTC / HSA without ever bringing them into the host's symbol scope:dlmopen(LM_ID_NEWLM, ..., RTLD_LAZY)— fresh link-map namespace; ROCm's libLLVM cannot interpose ours.dlopen(..., RTLD_LAZY | RTLD_LOCAL)with a one-timeLLVM_DEBUGadvisory that isolation is incomplete and the host must hide its own LLVM exports.LoadLibraryW(DLLs have private scopes per-DLL natively); SONAME is converted UTF-8 → UTF-16 to matchllvm/lib/Support/Windows/DynamicLibrary.inc.Cross-library coordination
KFD (the AMD kernel-fusion-driver) only allows one user-space HSA session per process. Multiple consumers (
AmdArchDbforrock.arch="native",mlir_rocm_runtime's JIT wrappers,rocmlir-tuning-driver's benchmark loop) must therefore share one HIP handle.RocmSystemDetect.cppis the canonical owner and exportsextern "C" LLVM_ALWAYS_EXPORT void *mlirRocmSystemDetectGetHipHandle(). Other consumers'loadRocmLibrary(Hip, Auto)looks this symbol up viaRTLD_DEFAULTand reuses the returned handle. HIPRTC and HSA load into HIP's link-map namespace via the loader'srelatedHandleparameter, so all three share the same KFD session.Defense-in-depth
libmlir_rocm_runtime.sogets a version script (mlir_rocm_runtime.map) that exports only themgpu*C ABI entry points.nm -Dconfirms 25mgpu*symbols, nothing else.-Wl,-Bsymbolic-functionsso intra-librarycl::*calls bind to in-library definitions and cannot be interposed by a later-loaded libLLVM.rocmlir-driver,rocmlir-opt, ...) getLINKER:--exclude-libs,ALLso they do not re-export LLVM statics fromlibLLVMSupport.a.Version compatibility
The loader is intentionally version-agnostic: it tries the unversioned SONAME first (
libamdhip64.so/amdhip64.dll, the standardfind_package(hip)resolution), then iterateslibamdhip64.so.<MAJOR>for descending MAJOR up to a generouskMaxProbedRocmMajor = 99. There is no compile-time floor or ceiling on the ROCm version this code supports — ROCm 4.x through any future release works with no source changes.Other improvements in this PR
While auditing the affected paths, the branch also fixes a handful of pre-existing latent bugs and missing test coverage:
AmdArchDb::nativeArchInfocache key wasgcnArchName, which would silently return device 0's CU/XCC/per-CU-shared-mem data for every later device on a same-arch multi-GPU system. Changed todeviceId(llvm::DenseMap<unsigned, AmdArchInfo>).parseArchString/ParamLookupTable::normalizeArchsilently treatednative:foo,native:1abc,native:asnative:0. Nowreport_fatal_errors with a precise diagnostic.rocmlir-tuning-driver'sHipDelayLoadpreviously returned a partially-initialized symbol table on missing libamdhip64 / libhiprtc; the macros inHipDelayLoadMacros.hthen dispatched through null function pointers and segfaulted. Tightened the contract to "fully loaded or process aborts" via[[noreturn]] abortMissing*helpers.MLIRRocmExecutionEngineUtilsCMake gating consumed${hip_INCLUDE_DIR}outside theif(MLIR_ENABLE_ROCM_RUNNER)gate that ranfind_package(hip). InBUILD_FAT_LIBROCKCOMPILER=ONbuilds (which force the runner off) this would silently fall through to ambient state. Moved the entire target definition inside the gate.Notable touched files
external/llvm-project/mlir/include/mlir/ExecutionEngine/RocmRuntimeLoader.h,lib/ExecutionEngine/RocmRuntimeLoader.cpp,RocmSystemDetect.{h,cpp},RocmRuntimeWrappers.cpp,mlir_rocm_runtime.map,lib/ExecutionEngine/CMakeLists.txtmlir/lib/Dialect/Rock/IR/AmdArchDb.cpp,mlir/lib/Dialect/Rock/IR/CMakeLists.txt,mlir/lib/Dialect/Rock/Tuning/ParamLookupTable.cpp,mlir/tools/rocmlir-tuning-driver/{HipDelayLoad.{h,cpp},HipDelayLoadMacros.h,CMakeLists.txt,CacheFlush.cpp,rocmlir-tuning-driver.cpp},mlir/tools/rocmlir-lib/CMakeLists.txtCMakeLists.txt(root),cmake/llvm-project.cmake,mlir/utils/jenkins/static-checks/get_fat_library_deps_list.plmlir/unittests/Dialect/Rock/{AmdArchDbTests.cpp,RocmRuntimeLoaderTests.cpp,CMakeLists.txt},mlir/test/Dialect/Rock/{native_arch.mlir,native_arch_invalid.mlir,Loader/*},mlir/test/{lit.cfg.py,lit.site.cfg.py.in}Test Plan
New regression tests (added in this PR)
Dialect/Rock/Loader/no_rocm_neededs.testlibamdhip64,libhiprtc,libamd_comgr, or ROCm's monolithiclibLLVM.soin itsDT_NEEDEDset. Audits 8 artefacts (default build) / 4 (fat-lib).Dialect/Rock/Loader/dynsym_only_mgpu.testlibmlir_rocm_runtime.soexports nothing butmgpu*entry points (the version-script contract).Dialect/Rock/native_arch.mlir(extended)rock.arch = "native"and"native:0"both flow through the rock pipeline and produce the same lowered IR as a kernel pinned to the concrete arch.Dialect/Rock/native_arch_invalid.mlir(new)rock.arch = "native:foo"aborts with the precise diagnostic vianot --crash.RocmRuntimeLoaderTests(4 cases)Autopolicy reusesRocmSystemDetect's HIP handle (proves theRTLD_DEFAULTcross-library coordination);Ownedpolicy returns a usable handle;resolveRocmSymbolis null-safe; loader resolves HIP without compile-time version knowledge.NativeArchParseTest(2 cases)native:*inputs each abort with the precise diagnostic; barenativeis well-formed.NativeArchCacheTest.SameArchMultiGpuDistinctManual / verification steps
bin/rocmlir-driver --kernel-pipeline=migraphx,highlevel --arch gfx942 ../test.mlir.bin/rocmlir-gen -p -t f32 --arch gfx942 -mfma=on | bin/rocmlir-driver --kernel-pipeline=full --arch gfx942produces a validgpu.binaryon both build configurations.nm -D libmlir_rocm_runtime.so | grep -v '^mgpu'returns nothing.readelf -don every shipped binary / shared library confirms nolibamdhip64,libhiprtc,libamd_comgr, or ROCm'slibLLVM.soin NEEDED.readelf -d ... failed/nm -D ... faileddiagnostics.Build configurations exercised
cmake -G Ninja .. -DCMAKE_BUILD_TYPE=RelWithDebInfo(defaultBUILD_SHARED_LIBS=ON,MLIR_ENABLE_ROCM_RUNNER=ON).cmake -G Ninja .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_FAT_LIBROCKCOMPILER=On(forcesBUILD_SHARED_LIBS=OFF,MLIR_ENABLE_ROCM_RUNNER=0). Used by the AMDMIGraphX integration.Test Result
ninja check-rocmlir(1559 tests)MLIRRockUnitTests(118 cases)ninja check-rocmlir(933 tests)MLIRRockUnitTests(114 cases)nm/readelfcleanliness audit (default)mgpu*exportednm/readelfcleanliness audit (fat-lib)clang-formaton touched filesReadLintson touched filesSubmission Checklist