fix(preflight): restrict amd-smi/rocm-smi calls to LOCAL_RANK 0 by yeandy · Pull Request #719 · AMD-AGI/Primus

yeandy · 2026-05-08T17:15:40Z

Summary

Restricts all amd-smi and rocm-smi subprocess invocations in the GPU preflight to LOCAL_RANK == 0 only, reducing spawns from 8×N_nodes to 1×N_nodes
Adds module-level caching to probe_gpus() to prevent redundant probes within a single rank
Suppresses spurious warnings on non-zero ranks that legitimately skip SMI tooling
Addresses simultaneous timeout failures across all nodes in large-scale jobs (observed on 128-node / 1024-GPU run)

Motivation

A job failed on all 128 nodes simultaneously with:

subprocess.TimeoutExpired: Command '['amd-smi', 'list', '--json']' timed out after 10 seconds

Root cause: With --nproc_per_node 8, every rank spawns amd-smi subprocesses independently. This triggers two compounding issues:

/dev/shm mutex contention — rocm_smi_lib uses a shared POSIX mutex in /dev/shm. When 8 processes per node contend on it simultaneously, pthread_mutex_timedlock() can exceed 10s (known upstream bug: Initialization sometimes fails on multi-GPU nodes due to race condition ROCm/rocm_smi_lib#88, still unfixed as of ROCm 6.x).
Shared filesystem subprocess overhead — Spawning Python subprocesses from a shared NFS/Lustre conda environment causes metadata thundering herd across 1024 ranks.
Since amd-smi output is node-level (identical for all ranks on the same host), only one rank per node needs to invoke it.

Changes

File	Change
`gpu/gpu_probe.py`	Gate `_probe_amd_smi()` and `_probe_rocm_smi()` behind `LOCAL_RANK == 0`; add `_PROBE_CACHE`
`gpu/gpu_topology.py`	Gate `_numa_mapping_best_effort()` and `_xgmi_presence_best_effort()` behind `LOCAL_RANK == 0`; downgrade warnings to info on non-zero ranks
`gpu/gpu_basic.py`	Suppress spurious "ROCm tooling not found" warnings on non-zero ranks (expected behavior, not an error)
`host/host_probe.py`	Extract `get_gpu_count_rocm_fallback()` (HIP_VISIBLE_DEVICES / sysfs, no subprocess) from `get_gpu_count_rocm()`
`host/info.py`	Gate `get_gpu_count_rocm()` (`rocm-smi --showid`) to `LOCAL_RANK == 0`; non-zero ranks use subprocess-free fallback

Impact

Metric	Before	After
`amd-smi` spawns per job	8 × N_nodes	1 × N_nodes
`rocm-smi` spawns per job	8–16 × N_nodes	1 × N_nodes
Mutex contenders per node	8+	1
FS metadata ops (subprocess)	8–16 × N_nodes	1 × N_nodes
For a 128-node job: 1,024+ → 128 subprocess spawns (87.5%+ reduction).

Validation (MI325X, strace-verified)

All validation used strace -f -e trace=execve to count every amd-smi/rocm-smi subprocess call from Python workers:

Test mode	Nodes	GPUs	SMI calls from Python workers (per node)	Result
`--perf-test` only	1N	8	0 (`probe_gpus` not invoked)	Pass
`--gpu` only	1N	8	All from LOCAL_RANK 0 only (`amd-smi list --json`, `rocm-smi -a`, topology calls)	Pass
`--gpu --network --perf-test`	2N	16	0 from Python workers; bash runner 2x/node (pre-torchrun, expected)	Pass
`--gpu --network --perf-test`	8N	64	0 from Python workers; bash runner 2x/node (pre-torchrun, expected)	Pass
Stress test: 512 concurrent `amd-smi` calls with 2s timeout reproduced 100% timeout rate, confirming the contention mechanism that this PR eliminates.
All tests exit code 0, zero spurious warnings on non-zero ranks.

yeandy force-pushed the yeandy/preflight-amdsmi-local-rank-gate branch from aa1c5a3 to ac536fe Compare May 8, 2026 21:41

restrict amd-smi/rocm-smi calls to LOCAL_RANK 0

6ba346b

yeandy force-pushed the yeandy/preflight-amdsmi-local-rank-gate branch from ac536fe to 6ba346b Compare May 8, 2026 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(preflight): restrict amd-smi/rocm-smi calls to LOCAL_RANK 0#719

fix(preflight): restrict amd-smi/rocm-smi calls to LOCAL_RANK 0#719
yeandy wants to merge 1 commit intodev/preflight-direct-testfrom
yeandy/preflight-amdsmi-local-rank-gate

yeandy commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yeandy commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Impact

Validation (MI325X, strace-verified)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yeandy commented May 8, 2026 •

edited

Loading