feat(runner): add run_preflight_direct.sh for non-container preflight#705
Draft
feat(runner): add run_preflight_direct.sh for non-container preflight#705
Conversation
SLURM-aware wrapper around `primus-cli direct -- preflight` that exports the distributed env (NNODES, NODE_RANK, MASTER_ADDR, ...) which `primus-cli direct` does not derive from SLURM itself, activates a shared Python venv via VENV_ACTIVATE, supports a wrapper-only --silent flag (preserving the report path via a saved fd), and auto-generates unique timestamped report names to avoid colliding with stale leftovers. Also add docs/preflight-direct.md with a full walkthrough (venv setup, SLURM invocation, NCCL config for Broadcom / Pensando AINIC, troubleshooting), and cross-link it from docs/README.md and docs/preflight.md.
### Summary
Preflight inter-node benchmarks assumed power-of-2 node counts, causing
two bugs:
1. **Silent exclusion** -- remainder nodes never benchmarked for
2-node/4-node adjacency tests (e.g., on 6 nodes with 4-node adjacency,
nodes 4-5 were silently dropped -- 33% of nodes never tested)
2. **P2P crash** -- odd node counts triggered `AssertionError: peer_rank
>= WORLD_SIZE` because the unpaired trailing node tried to compute a
peer beyond world size
This PR fixes both so preflight runs correctly on **any** node count.
### What changed
| File | Change |
|---|---|
| `inter_node_comm.py` | Create remainder group when `num_nodes %
adjacent_nodes >= 2`. Use `dist.get_world_size()` for actual group size.
Report via explicit `group_leaders` list instead of modular arithmetic.
|
| `inter_node_comm_p2p.py` | Guard `peer_rank` with `RANK <
num_paired_ranks`. Unpaired ranks get `peer_rank = -1` and skip
gracefully. Same guard in reporting loop. |
### How it works
```
remainder = num_nodes % adjacent_nodes
remainder = 0 -> all nodes in equal groups (unchanged)
remainder = 1 -> excluded (correct -- can't do inter-node comm with 1 node)
remainder >= 2 -> OLD: silently excluded. NEW: remainder group created & benchmarked
```
Example -- 6 nodes, `allreduce-4nodes`:
- **Old:** only nodes 0-3 benchmarked (1 group of 32 ranks). Nodes 4-5
silently excluded.
- **New:** nodes 0-3 form the main group (32 ranks), nodes 4-5 form a
remainder group (16 ranks). Both reported.
### Test plan
Ran preflight on **1-9 nodes** with `NCCL_DEBUG=INFO` enabled. All
passed cleanly.
| Nodes | Verification |
|---|---|
| 2, 4, 8 | POW2 regression -- identical to main |
| 3, 5, 9 | Odd counts -- P2P no crash, unpaired node skipped gracefully
|
| 6 | `allreduce-4nodes` reports 2 group leaders: rank 0 (4-node main) +
rank 32 (2-node remainder) |
| 7 | `allreduce-4nodes` reports 2 group leaders: rank 0 (4-node main) +
rank 32 (3-node remainder) |
| 1 | All inter-node tests skip cleanly |
#### NCCL-level verification (6N example)
Confirmed via `ncclCommSplit_impl` logs that two separate
sub-communicators are created:
```
splitCount 87: g52+g53 (nodes 4-5) -> nranks=16, color 1721561492 (remainder group)
splitCount 88: g2+g14+g50+g51 (nodes 0-3) -> nranks=32, color 2008404567 (main group)
```
Different colors = separate sub-communicators. Remainder nodes do NOT
appear in the main group's split.
#### Caveat
When `remainder = 1` (e.g., 5N with 4-node adjacency), the single
leftover node is excluded from that sub-group test -- inter-node
communication requires >= 2 nodes. It's still covered by the full N-node
allreduce/alltoall and ring P2P.
### Test results
All jobs ran on partition `amd-tw` with `NCCL_DEBUG=INFO`. Node g57
excluded due to faulty RDMA NIC.
| Nodes | Ranks | Slurm Job | Nodes Used | Exit |
|---|---|---|---|---|
| 1 | 8 | 477 | g59 | 0 |
| 2 | 16 | 478 | g2, g14 | 0 |
| 3 | 24 | 479 | g25, g26, g27 | 0 |
| 4 | 32 | 480 | g50, g51, g52, g53 | 0 |
| 5 | 40 | 481 | g15, g29, g54, g55, g59 | 0 |
| 6 | 48 | 482 | g2, g14, g50, g51, g52, g53 | 0 |
| 7 | 56 | 483 | g15, g25, g26, g27, g29, g54, g55 | 0 |
| 8 | 64 | 484 | g2, g14, g15, g25, g26, g27, g29, g50 | 0 |
| 9 | 72 | 485 | g2, g14, g15, g50, g51, g52, g53, g54, g55 | 0 |
#### Remainder group output (6N, `slurm-482.out` line 54476)
```
=======InterNodeComm - allreduce-4nodes (GB/s)=======
Hostname Node Rank 2MB 4MB 8MB 16MB 32MB 64MB 128MB 256MB 512MB 1024MB
tus1-p3-g2 0 0 3.73 5.31 5.03 5.84 10.96 7.55 7.41 8.01 15.01 29.27
tus1-p3-g52 4 32 32.46 53.65 77.70 96.48 135.12 175.10 215.54 324.20 336.77 344.32
```
Two group leaders reported: rank 0 (main 4-node group, nodes 0-3) and
rank 32 (remainder 2-node group, nodes 4-5).
#### Remainder group output (7N, `slurm-483.out` line 68237)
```
=======InterNodeComm - allreduce-4nodes (GB/s)=======
Hostname Node Rank 2MB 4MB 8MB 16MB 32MB 64MB 128MB 256MB 512MB 1024MB
tus1-p3-g15 0 0 4.28 5.64 5.49 6.03 11.37 7.55 7.40 8.11 18.00 30.02
tus1-p3-g29 4 32 9.94 14.09 14.52 15.25 30.92 20.28 19.93 20.25 40.16 77.46
```
Two group leaders: rank 0 (main 4-node group, nodes 0-3) and rank 32
(remainder 3-node group, nodes 4-6).
#### P2P on odd counts (no crash)
```
=======InterNodeComm - p2p-2nodes (GB/s)======= [3N, slurm-479.out]
Hostname Node Rank 2MB ... 1024MB
tus1-p3-g25 0 0 12.25 ... 20.40 <- paired (node 0 <-> node 1)
node 2 absent (unpaired, skipped)
=======InterNodeComm - p2p-2nodes (GB/s)======= [5N, slurm-481.out]
Hostname Node Rank 2MB ... 1024MB
tus1-p3-g15 0 0 10.83 ... 10.12 <- pair 1 (node 0 <-> node 1)
tus1-p3-g15 0 1 11.00 ... 9.98
...
node 4 absent (unpaired, skipped)
```
#### Ring P2P includes all nodes (odd counts)
```
=======InterNodeRingP2P bidirectional bandwidth - ringsize=5 - (GB/s)=======
ring 10MB 20MB 40MB 80MB 160MB
0 49.79 52.22 53.95 54.93 54.87
1 49.57 53.02 53.80 54.93 55.19
2 49.71 52.39 53.84 54.78 54.98
3 49.37 52.21 53.94 54.69 55.21
4 48.15 53.20 53.84 54.94 55.24
5 47.59 52.54 53.73 54.70 55.07
```
All 5 nodes participate in the ring (including node 4 which was excluded
from P2P pairing).
#### NCCL warnings
All NCCL WARN messages across all jobs are benign IB port-state
notifications:
```
NCCL WARN NET/IB : rdmaN:1 Got async error event: port active
```
No `NCCL ERROR`, `ncclSystemError`, or `SIGTERM` in any completed job.
### Out of scope
This PR addresses preflight benchmark correctness only. Early-stopping /
RCCL hang diagnosis is in progress in #689.
…-node tests (#706) Allows skipping 2-node and 4-node subgroup inter-node comm tests, running only the all-node allreduce/alltoall and ring p2p.
Move optional dependencies to lazy imports so preflight can run with only torch installed. matplotlib is now imported only when --plot is used; markdown2 and weasyprint are imported only inside md_to_pdf() (skipped when --disable-pdf is passed). --------- Co-authored-by: Andrew Ma <andrew.ma@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SLURM-aware wrapper around
primus-cli direct -- preflightthat exports the distributed env (NNODES, NODE_RANK, MASTER_ADDR, ...) whichprimus-cli directdoes not derive from SLURM itself, activates a shared Python venv via VENV_ACTIVATE, supports a wrapper-only --silent flag (preserving the report path via a saved fd), and auto-generates unique timestamped report names to avoid colliding with stale leftovers.Also add docs/preflight-direct.md with a full walkthrough (venv setup, SLURM invocation, NCCL config for Broadcom / Pensando AINIC, troubleshooting), and cross-link it from docs/README.md and docs/preflight.md.