fix(preflight): restrict amd-smi/rocm-smi calls to LOCAL_RANK 0#719
Draft
yeandy wants to merge 1 commit intodev/preflight-direct-testfrom
Draft
fix(preflight): restrict amd-smi/rocm-smi calls to LOCAL_RANK 0#719yeandy wants to merge 1 commit intodev/preflight-direct-testfrom
yeandy wants to merge 1 commit intodev/preflight-direct-testfrom
Conversation
aa1c5a3 to
ac536fe
Compare
ac536fe to
6ba346b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
amd-smiandrocm-smisubprocess invocations in the GPU preflight toLOCAL_RANK == 0only, reducing spawns from 8×N_nodes to 1×N_nodesprobe_gpus()to prevent redundant probes within a single rankMotivation
A job failed on all 128 nodes simultaneously with:
Root cause: With
--nproc_per_node 8, every rank spawnsamd-smisubprocesses independently. This triggers two compounding issues:/dev/shmmutex contention —rocm_smi_libuses a shared POSIX mutex in/dev/shm. When 8 processes per node contend on it simultaneously,pthread_mutex_timedlock()can exceed 10s (known upstream bug: Initialization sometimes fails on multi-GPU nodes due to race condition ROCm/rocm_smi_lib#88, still unfixed as of ROCm 6.x).Since
amd-smioutput is node-level (identical for all ranks on the same host), only one rank per node needs to invoke it.Changes
gpu/gpu_probe.py_probe_amd_smi()and_probe_rocm_smi()behindLOCAL_RANK == 0; add_PROBE_CACHEgpu/gpu_topology.py_numa_mapping_best_effort()and_xgmi_presence_best_effort()behindLOCAL_RANK == 0; downgrade warnings to info on non-zero ranksgpu/gpu_basic.pyhost/host_probe.pyget_gpu_count_rocm_fallback()(HIP_VISIBLE_DEVICES / sysfs, no subprocess) fromget_gpu_count_rocm()host/info.pyget_gpu_count_rocm()(rocm-smi --showid) toLOCAL_RANK == 0; non-zero ranks use subprocess-free fallbackImpact
amd-smispawns per jobrocm-smispawns per jobValidation (MI325X, strace-verified)
All validation used
strace -f -e trace=execveto count everyamd-smi/rocm-smisubprocess call from Python workers:--perf-testonlyprobe_gpusnot invoked)--gpuonlyamd-smi list --json,rocm-smi -a, topology calls)--gpu --network --perf-test--gpu --network --perf-testamd-smicalls with 2s timeout reproduced 100% timeout rate, confirming the contention mechanism that this PR eliminates.