Skip to content

fix(preflight): restrict amd-smi/rocm-smi calls to LOCAL_RANK 0#719

Draft
yeandy wants to merge 1 commit intodev/preflight-direct-testfrom
yeandy/preflight-amdsmi-local-rank-gate
Draft

fix(preflight): restrict amd-smi/rocm-smi calls to LOCAL_RANK 0#719
yeandy wants to merge 1 commit intodev/preflight-direct-testfrom
yeandy/preflight-amdsmi-local-rank-gate

Conversation

@yeandy
Copy link
Copy Markdown

@yeandy yeandy commented May 8, 2026

Summary

  • Restricts all amd-smi and rocm-smi subprocess invocations in the GPU preflight to LOCAL_RANK == 0 only, reducing spawns from 8×N_nodes to 1×N_nodes
  • Adds module-level caching to probe_gpus() to prevent redundant probes within a single rank
  • Suppresses spurious warnings on non-zero ranks that legitimately skip SMI tooling
  • Addresses simultaneous timeout failures across all nodes in large-scale jobs (observed on 128-node / 1024-GPU run)

Motivation

A job failed on all 128 nodes simultaneously with:

subprocess.TimeoutExpired: Command '['amd-smi', 'list', '--json']' timed out after 10 seconds

Root cause: With --nproc_per_node 8, every rank spawns amd-smi subprocesses independently. This triggers two compounding issues:

  1. /dev/shm mutex contentionrocm_smi_lib uses a shared POSIX mutex in /dev/shm. When 8 processes per node contend on it simultaneously, pthread_mutex_timedlock() can exceed 10s (known upstream bug: Initialization sometimes fails on multi-GPU nodes due to race condition ROCm/rocm_smi_lib#88, still unfixed as of ROCm 6.x).
  2. Shared filesystem subprocess overhead — Spawning Python subprocesses from a shared NFS/Lustre conda environment causes metadata thundering herd across 1024 ranks.
    Since amd-smi output is node-level (identical for all ranks on the same host), only one rank per node needs to invoke it.

Changes

File Change
gpu/gpu_probe.py Gate _probe_amd_smi() and _probe_rocm_smi() behind LOCAL_RANK == 0; add _PROBE_CACHE
gpu/gpu_topology.py Gate _numa_mapping_best_effort() and _xgmi_presence_best_effort() behind LOCAL_RANK == 0; downgrade warnings to info on non-zero ranks
gpu/gpu_basic.py Suppress spurious "ROCm tooling not found" warnings on non-zero ranks (expected behavior, not an error)
host/host_probe.py Extract get_gpu_count_rocm_fallback() (HIP_VISIBLE_DEVICES / sysfs, no subprocess) from get_gpu_count_rocm()
host/info.py Gate get_gpu_count_rocm() (rocm-smi --showid) to LOCAL_RANK == 0; non-zero ranks use subprocess-free fallback

Impact

Metric Before After
amd-smi spawns per job 8 × N_nodes 1 × N_nodes
rocm-smi spawns per job 8–16 × N_nodes 1 × N_nodes
Mutex contenders per node 8+ 1
FS metadata ops (subprocess) 8–16 × N_nodes 1 × N_nodes
For a 128-node job: 1,024+ → 128 subprocess spawns (87.5%+ reduction).

Validation (MI325X, strace-verified)

All validation used strace -f -e trace=execve to count every amd-smi/rocm-smi subprocess call from Python workers:

Test mode Nodes GPUs SMI calls from Python workers (per node) Result
--perf-test only 1N 8 0 (probe_gpus not invoked) Pass
--gpu only 1N 8 All from LOCAL_RANK 0 only (amd-smi list --json, rocm-smi -a, topology calls) Pass
--gpu --network --perf-test 2N 16 0 from Python workers; bash runner 2x/node (pre-torchrun, expected) Pass
--gpu --network --perf-test 8N 64 0 from Python workers; bash runner 2x/node (pre-torchrun, expected) Pass
Stress test: 512 concurrent amd-smi calls with 2s timeout reproduced 100% timeout rate, confirming the contention mechanism that this PR eliminates.
All tests exit code 0, zero spurious warnings on non-zero ranks.

@yeandy yeandy force-pushed the yeandy/preflight-amdsmi-local-rank-gate branch from aa1c5a3 to ac536fe Compare May 8, 2026 21:41
@yeandy yeandy force-pushed the yeandy/preflight-amdsmi-local-rank-gate branch from ac536fe to 6ba346b Compare May 8, 2026 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant