Skip to content

Silent SEGV in host proxy when ENABLE_GPU_IPC=false with MPI backend; no fail-fast when PCIe atomics unavailable #20

@SHREYASINGH29

Description

@SHREYASINGH29

Summary

On a 2× Intel Data Center GPU Max 1100 host (PCIe-only, no XeLink), ishmem's distributed PUT/GET path fails in two ways depending on ENABLE_GPU_IPC:

  1. ENABLE_GPU_IPC=true (default): GPU-side AtomicAccessViolation during cross-PE synchronization (expected on platforms without PCIe AtomicOps but ishmem's README doesn't warn about this).
  2. ENABLE_GPU_IPC=false: Host proxy thread silently SEGVs (SEGV_MAPERR) at a USM device address. No error message, just crash.

Collectives backed by MPI (reduce_*, barrier, sync) work in both configurations. Point-to-point device ops (put, get, alltoall, broadcast, collect, AMOs) fail.


Environment

  • Hardware: 2× Intel Data Center GPU Max 1100 (PVC, intel_gpu_pvc), PCIe-attached, single node
  • Kernel: Linux 5.15.0 (i915 driver; xe not available on this kernel)
  • Intel oneAPI: 2025.3 (DPC++/C++ Compiler 2025.3.2, MPI 2021.17 Build 20251215)
  • Intel SHMEM: 1.5.0 (built from source on the main branch; also tested with prebuilt intel-shmem-1.5.0.224_offline.sh same behavior)
  • NEO compute runtime: 25.18.33578.51
  • ishmem build: cmake .. -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DENABLE_MPI=ON -DMPI_DIR=/opt/intel/oneapi/mpi/2021.17

Test results (2 PEs, one on each GPU)

Test ENABLE_GPU_IPC=true ENABLE_GPU_IPC=false
1_helloworld ✅ pass ✅ pass
align, init_attr, timestamp ✅ pass ✅ pass
barrier, sync ✅ pass ✅ pass
reduce_sum/min/max/prod/and/or/xor ✅ pass ✅ pass
put, get, put_nbi, get_nbi, ibput ❌ AtomicAccessViolation ❌ Host SEGV
alltoall, broadcast, collect, fcollect ❌ AtomicAccessViolation ❌ Host SEGV
amo_* (all variants) ❌ AtomicAccessViolation ❌ Host SEGV

Same pattern observed in test/performance/*_bw benchmarks.


Bug 1: Silent SEGV with ENABLE_GPU_IPC=false

Reproduction

source /opt/intel/oneapi/setvars.sh
export EnableImplicitScaling=0 NEOReadDebugKeys=1
export ISHMEM_ENABLE_GPU_IPC=false
export ISHMEM_DEBUG=1
cd $BUILD/test/unit
mpirun -n 2 bash -c 'export ZE_AFFINITY_MASK=$MPI_LOCALRANKID; exec ./put'

Output

[0] Testing device with device memory
[1] Testing device with device memory
BAD TERMINATION ... RANK 1 ... KILLED BY SIGNAL: 11 (Segmentation fault)

strace shows the SEGV is in the host proxy thread (spawned in proxy_init), not the main thread, with si_code=SEGV_MAPERR, si_addr=0xff000000006096c0 an Intel GPU USM virtual address not mapped in the CPU process.

Possible root cause

With ENABLE_GPU_IPC=false:

  • ipc_init() is skipped (ishmem.cpp:326), so local_pes[remote_pe] = 0.
  • Device ishmem_put sees local_index == 0, falls through to ishmemi_proxy_blocking_request (rma_impl.h:39).
  • Host proxy thread dequeues the request and calls ishmemi_upcall_funcs[PUT][UINT8]ishmem_uint8_put on the host (proxy_func.cpp:17-22).
  • The host-side ishmem_internal_put calls ishmemi_ipc_put (rma_impl.h:40), which hits get_ipc_buffer(pe, dst) → returns nullptr because local_pes[pe] == 0 (runtime_ipc.h:28).
  • Returns non-zero, so the code falls through to ishmemi_runtime->proxy_funcs[PUT][UINT8] the MPI backend (rma_impl.h:41).
  • MPI backend calls MPI_Put(src, ..., dest_disp, ..., win) where win was created with ishmemi_heap_base (a GPU USM pointer) as the base (runtime_mpi.cpp:1384).
  • Intel MPI 2021.17 (without I_MPI_OFFLOAD=2) does not handle GPU USM as a window base, it tries to CPU-dereference the address in its RMA backend → SEGV at the USM address.

Partial workaround

Setting I_MPI_OFFLOAD=2 I_MPI_OFFLOAD_RMA=1 (Intel MPI 2021.17) changes the behavior:

  • No SEGV.
  • But data corruption: received bytes are stamped with 0x8080808080808080 / 0x8181818181818181 patterns instead of the actual payload. Probably a separate Intel MPI / libfabric bug with GPU RMA, but surfaces here because ishmem has no way to validate the backend's GPU RMA support.

Bug 2: No guidance when PCIe AtomicOps unavailable

Reproduction

Defaults (ENABLE_GPU_IPC=true), same hardware:

mpirun -n 2 bash -c 'export ZE_AFFINITY_MASK=$MPI_LOCALRANKID; exec ./alltoall-device'

Output

Segmentation fault from GPU at 0xff00000020210000, ctx_id: 1 (CCS)
  type: 2 (AtomicAccessViolation), level: 0 (PTE), access: 2 (Atomic),
  banned: 1, aborting.
Abort was called at 288 line in file: ./shared/source/os_interface/linux/drm_neo.cpp

Root cause

ishmemi_team_sync (collectives/sync_impl.h:56) does atomic_psync += 1L on the remote PE's psync counter, translating through ISHMEMI_FAST_ADJUST. This is a cross-device atomic fetch-add over PCIe. On PCIe-attached Max GPUs without PCIe AtomicOps (BIOS setting + modern xe driver), the kernel rejects the atomic at the PTE level.

Things that do NOT fix it (tested)

  • ISHMEM_ENABLE_GPU_IPC_PIDFD=false (forces socket-based IPC)
  • EnableConcurrentSharedCrossP2PDeviceAccess=1 (NEO debug key)
  • DisableScratchPages=0 EnableRecoverablePageFaults=1 (suppresses abort, but then UR_RESULT_ERROR_DEVICE_LOST)

Also affecting

Same-GPU placement (ZE_AFFINITY_MASK=0 on both PEs) hangs indefinitely on barrier/reduce_sum/put presumably the atomic-spin synchronization deadlocks when two contexts share a device. Probably won't be prioritized, but noting it.


Things that work fine

  • Single PE (mpirun -n 1): all tests pass
  • Collectives backed by MPI (reductions, barriers, sync): work in both ENABLE_GPU_IPC={true,false} modes
  • ishmem build, runtime, and local-device code paths are all healthy

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions