Summary
On a 2× Intel Data Center GPU Max 1100 host (PCIe-only, no XeLink), ishmem's distributed PUT/GET path fails in two ways depending on ENABLE_GPU_IPC:
ENABLE_GPU_IPC=true (default): GPU-side AtomicAccessViolation during cross-PE synchronization (expected on platforms without PCIe AtomicOps but ishmem's README doesn't warn about this).
ENABLE_GPU_IPC=false: Host proxy thread silently SEGVs (SEGV_MAPERR) at a USM device address. No error message, just crash.
Collectives backed by MPI (reduce_*, barrier, sync) work in both configurations. Point-to-point device ops (put, get, alltoall, broadcast, collect, AMOs) fail.
Environment
- Hardware: 2× Intel Data Center GPU Max 1100 (PVC,
intel_gpu_pvc), PCIe-attached, single node
- Kernel: Linux 5.15.0 (i915 driver;
xe not available on this kernel)
- Intel oneAPI: 2025.3 (DPC++/C++ Compiler 2025.3.2, MPI 2021.17 Build 20251215)
- Intel SHMEM: 1.5.0 (built from source on the
main branch; also tested with prebuilt intel-shmem-1.5.0.224_offline.sh same behavior)
- NEO compute runtime: 25.18.33578.51
- ishmem build:
cmake .. -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DENABLE_MPI=ON -DMPI_DIR=/opt/intel/oneapi/mpi/2021.17
Test results (2 PEs, one on each GPU)
| Test |
ENABLE_GPU_IPC=true |
ENABLE_GPU_IPC=false |
1_helloworld |
✅ pass |
✅ pass |
align, init_attr, timestamp |
✅ pass |
✅ pass |
barrier, sync |
✅ pass |
✅ pass |
reduce_sum/min/max/prod/and/or/xor |
✅ pass |
✅ pass |
put, get, put_nbi, get_nbi, ibput |
❌ AtomicAccessViolation |
❌ Host SEGV |
alltoall, broadcast, collect, fcollect |
❌ AtomicAccessViolation |
❌ Host SEGV |
amo_* (all variants) |
❌ AtomicAccessViolation |
❌ Host SEGV |
Same pattern observed in test/performance/*_bw benchmarks.
Bug 1: Silent SEGV with ENABLE_GPU_IPC=false
Reproduction
source /opt/intel/oneapi/setvars.sh
export EnableImplicitScaling=0 NEOReadDebugKeys=1
export ISHMEM_ENABLE_GPU_IPC=false
export ISHMEM_DEBUG=1
cd $BUILD/test/unit
mpirun -n 2 bash -c 'export ZE_AFFINITY_MASK=$MPI_LOCALRANKID; exec ./put'
Output
[0] Testing device with device memory
[1] Testing device with device memory
BAD TERMINATION ... RANK 1 ... KILLED BY SIGNAL: 11 (Segmentation fault)
strace shows the SEGV is in the host proxy thread (spawned in proxy_init), not the main thread, with si_code=SEGV_MAPERR, si_addr=0xff000000006096c0 an Intel GPU USM virtual address not mapped in the CPU process.
Possible root cause
With ENABLE_GPU_IPC=false:
ipc_init() is skipped (ishmem.cpp:326), so local_pes[remote_pe] = 0.
- Device
ishmem_put sees local_index == 0, falls through to ishmemi_proxy_blocking_request (rma_impl.h:39).
- Host proxy thread dequeues the request and calls
ishmemi_upcall_funcs[PUT][UINT8] → ishmem_uint8_put on the host (proxy_func.cpp:17-22).
- The host-side
ishmem_internal_put calls ishmemi_ipc_put (rma_impl.h:40), which hits get_ipc_buffer(pe, dst) → returns nullptr because local_pes[pe] == 0 (runtime_ipc.h:28).
- Returns non-zero, so the code falls through to
ishmemi_runtime->proxy_funcs[PUT][UINT8] the MPI backend (rma_impl.h:41).
- MPI backend calls
MPI_Put(src, ..., dest_disp, ..., win) where win was created with ishmemi_heap_base (a GPU USM pointer) as the base (runtime_mpi.cpp:1384).
- Intel MPI 2021.17 (without
I_MPI_OFFLOAD=2) does not handle GPU USM as a window base, it tries to CPU-dereference the address in its RMA backend → SEGV at the USM address.
Partial workaround
Setting I_MPI_OFFLOAD=2 I_MPI_OFFLOAD_RMA=1 (Intel MPI 2021.17) changes the behavior:
- No SEGV.
- But data corruption: received bytes are stamped with
0x8080808080808080 / 0x8181818181818181 patterns instead of the actual payload. Probably a separate Intel MPI / libfabric bug with GPU RMA, but surfaces here because ishmem has no way to validate the backend's GPU RMA support.
Bug 2: No guidance when PCIe AtomicOps unavailable
Reproduction
Defaults (ENABLE_GPU_IPC=true), same hardware:
mpirun -n 2 bash -c 'export ZE_AFFINITY_MASK=$MPI_LOCALRANKID; exec ./alltoall-device'
Output
Segmentation fault from GPU at 0xff00000020210000, ctx_id: 1 (CCS)
type: 2 (AtomicAccessViolation), level: 0 (PTE), access: 2 (Atomic),
banned: 1, aborting.
Abort was called at 288 line in file: ./shared/source/os_interface/linux/drm_neo.cpp
Root cause
ishmemi_team_sync (collectives/sync_impl.h:56) does atomic_psync += 1L on the remote PE's psync counter, translating through ISHMEMI_FAST_ADJUST. This is a cross-device atomic fetch-add over PCIe. On PCIe-attached Max GPUs without PCIe AtomicOps (BIOS setting + modern xe driver), the kernel rejects the atomic at the PTE level.
Things that do NOT fix it (tested)
ISHMEM_ENABLE_GPU_IPC_PIDFD=false (forces socket-based IPC)
EnableConcurrentSharedCrossP2PDeviceAccess=1 (NEO debug key)
DisableScratchPages=0 EnableRecoverablePageFaults=1 (suppresses abort, but then UR_RESULT_ERROR_DEVICE_LOST)
Also affecting
Same-GPU placement (ZE_AFFINITY_MASK=0 on both PEs) hangs indefinitely on barrier/reduce_sum/put presumably the atomic-spin synchronization deadlocks when two contexts share a device. Probably won't be prioritized, but noting it.
Things that work fine
- Single PE (
mpirun -n 1): all tests pass
- Collectives backed by MPI (reductions, barriers, sync): work in both
ENABLE_GPU_IPC={true,false} modes
- ishmem build, runtime, and local-device code paths are all healthy
Summary
On a 2× Intel Data Center GPU Max 1100 host (PCIe-only, no XeLink), ishmem's distributed PUT/GET path fails in two ways depending on
ENABLE_GPU_IPC:ENABLE_GPU_IPC=true(default): GPU-sideAtomicAccessViolationduring cross-PE synchronization (expected on platforms without PCIe AtomicOps but ishmem's README doesn't warn about this).ENABLE_GPU_IPC=false: Host proxy thread silently SEGVs (SEGV_MAPERR) at a USM device address. No error message, just crash.Collectives backed by MPI (
reduce_*,barrier,sync) work in both configurations. Point-to-point device ops (put,get,alltoall,broadcast,collect, AMOs) fail.Environment
intel_gpu_pvc), PCIe-attached, single nodexenot available on this kernel)mainbranch; also tested with prebuiltintel-shmem-1.5.0.224_offline.shsame behavior)cmake .. -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DENABLE_MPI=ON -DMPI_DIR=/opt/intel/oneapi/mpi/2021.17Test results (2 PEs, one on each GPU)
ENABLE_GPU_IPC=trueENABLE_GPU_IPC=false1_helloworldalign,init_attr,timestampbarrier,syncreduce_sum/min/max/prod/and/or/xorput,get,put_nbi,get_nbi,ibputalltoall,broadcast,collect,fcollectamo_*(all variants)Same pattern observed in
test/performance/*_bwbenchmarks.Bug 1: Silent SEGV with
ENABLE_GPU_IPC=falseReproduction
Output
straceshows the SEGV is in the host proxy thread (spawned inproxy_init), not the main thread, withsi_code=SEGV_MAPERR, si_addr=0xff000000006096c0an Intel GPU USM virtual address not mapped in the CPU process.Possible root cause
With
ENABLE_GPU_IPC=false:ipc_init()is skipped (ishmem.cpp:326), solocal_pes[remote_pe] = 0.ishmem_putseeslocal_index == 0, falls through toishmemi_proxy_blocking_request(rma_impl.h:39).ishmemi_upcall_funcs[PUT][UINT8]→ishmem_uint8_puton the host (proxy_func.cpp:17-22).ishmem_internal_putcallsishmemi_ipc_put(rma_impl.h:40), which hitsget_ipc_buffer(pe, dst)→ returnsnullptrbecauselocal_pes[pe] == 0(runtime_ipc.h:28).ishmemi_runtime->proxy_funcs[PUT][UINT8]the MPI backend (rma_impl.h:41).MPI_Put(src, ..., dest_disp, ..., win)wherewinwas created withishmemi_heap_base(a GPU USM pointer) as the base (runtime_mpi.cpp:1384).I_MPI_OFFLOAD=2) does not handle GPU USM as a window base, it tries to CPU-dereference the address in its RMA backend → SEGV at the USM address.Partial workaround
Setting
I_MPI_OFFLOAD=2 I_MPI_OFFLOAD_RMA=1(Intel MPI 2021.17) changes the behavior:0x8080808080808080/0x8181818181818181patterns instead of the actual payload. Probably a separate Intel MPI / libfabric bug with GPU RMA, but surfaces here because ishmem has no way to validate the backend's GPU RMA support.Bug 2: No guidance when PCIe AtomicOps unavailable
Reproduction
Defaults (
ENABLE_GPU_IPC=true), same hardware:mpirun -n 2 bash -c 'export ZE_AFFINITY_MASK=$MPI_LOCALRANKID; exec ./alltoall-device'Output
Root cause
ishmemi_team_sync(collectives/sync_impl.h:56) doesatomic_psync += 1Lon the remote PE's psync counter, translating throughISHMEMI_FAST_ADJUST. This is a cross-device atomic fetch-add over PCIe. On PCIe-attached Max GPUs without PCIe AtomicOps (BIOS setting + modernxedriver), the kernel rejects the atomic at the PTE level.Things that do NOT fix it (tested)
ISHMEM_ENABLE_GPU_IPC_PIDFD=false(forces socket-based IPC)EnableConcurrentSharedCrossP2PDeviceAccess=1(NEO debug key)DisableScratchPages=0 EnableRecoverablePageFaults=1(suppresses abort, but thenUR_RESULT_ERROR_DEVICE_LOST)Also affecting
Same-GPU placement (
ZE_AFFINITY_MASK=0on both PEs) hangs indefinitely onbarrier/reduce_sum/putpresumably the atomic-spin synchronization deadlocks when two contexts share a device. Probably won't be prioritized, but noting it.Things that work fine
mpirun -n 1): all tests passENABLE_GPU_IPC={true,false}modes