Skip to content

feat(runner): add run_preflight_direct.sh for non-container preflight#705

Draft
yeandy wants to merge 4 commits intomainfrom
dev/preflight-direct-test
Draft

feat(runner): add run_preflight_direct.sh for non-container preflight#705
yeandy wants to merge 4 commits intomainfrom
dev/preflight-direct-test

Conversation

@yeandy
Copy link
Copy Markdown

@yeandy yeandy commented Apr 29, 2026

SLURM-aware wrapper around primus-cli direct -- preflight that exports the distributed env (NNODES, NODE_RANK, MASTER_ADDR, ...) which primus-cli direct does not derive from SLURM itself, activates a shared Python venv via VENV_ACTIVATE, supports a wrapper-only --silent flag (preserving the report path via a saved fd), and auto-generates unique timestamped report names to avoid colliding with stale leftovers.

Also add docs/preflight-direct.md with a full walkthrough (venv setup, SLURM invocation, NCCL config for Broadcom / Pensando AINIC, troubleshooting), and cross-link it from docs/README.md and docs/preflight.md.

amd-fuyuajin and others added 4 commits April 29, 2026 16:31
SLURM-aware wrapper around `primus-cli direct -- preflight` that
exports the distributed env (NNODES, NODE_RANK, MASTER_ADDR, ...)
which `primus-cli direct` does not derive from SLURM itself,
activates a shared Python venv via VENV_ACTIVATE, supports a
wrapper-only --silent flag (preserving the report path via a
saved fd), and auto-generates unique timestamped report names
to avoid colliding with stale leftovers.

Also add docs/preflight-direct.md with a full walkthrough (venv
setup, SLURM invocation, NCCL config for Broadcom / Pensando AINIC,
troubleshooting), and cross-link it from docs/README.md and
docs/preflight.md.
### Summary

Preflight inter-node benchmarks assumed power-of-2 node counts, causing
two bugs:

1. **Silent exclusion** -- remainder nodes never benchmarked for
2-node/4-node adjacency tests (e.g., on 6 nodes with 4-node adjacency,
nodes 4-5 were silently dropped -- 33% of nodes never tested)
2. **P2P crash** -- odd node counts triggered `AssertionError: peer_rank
>= WORLD_SIZE` because the unpaired trailing node tried to compute a
peer beyond world size

This PR fixes both so preflight runs correctly on **any** node count.

### What changed

| File | Change |
|---|---|
| `inter_node_comm.py` | Create remainder group when `num_nodes %
adjacent_nodes >= 2`. Use `dist.get_world_size()` for actual group size.
Report via explicit `group_leaders` list instead of modular arithmetic.
|
| `inter_node_comm_p2p.py` | Guard `peer_rank` with `RANK <
num_paired_ranks`. Unpaired ranks get `peer_rank = -1` and skip
gracefully. Same guard in reporting loop. |

### How it works

```
remainder = num_nodes % adjacent_nodes

remainder = 0  -> all nodes in equal groups (unchanged)
remainder = 1  -> excluded (correct -- can't do inter-node comm with 1 node)
remainder >= 2 -> OLD: silently excluded. NEW: remainder group created & benchmarked
```

Example -- 6 nodes, `allreduce-4nodes`:
- **Old:** only nodes 0-3 benchmarked (1 group of 32 ranks). Nodes 4-5
silently excluded.
- **New:** nodes 0-3 form the main group (32 ranks), nodes 4-5 form a
remainder group (16 ranks). Both reported.

### Test plan

Ran preflight on **1-9 nodes** with `NCCL_DEBUG=INFO` enabled. All
passed cleanly.

| Nodes | Verification |
|---|---|
| 2, 4, 8 | POW2 regression -- identical to main |
| 3, 5, 9 | Odd counts -- P2P no crash, unpaired node skipped gracefully
|
| 6 | `allreduce-4nodes` reports 2 group leaders: rank 0 (4-node main) +
rank 32 (2-node remainder) |
| 7 | `allreduce-4nodes` reports 2 group leaders: rank 0 (4-node main) +
rank 32 (3-node remainder) |
| 1 | All inter-node tests skip cleanly |

#### NCCL-level verification (6N example)

Confirmed via `ncclCommSplit_impl` logs that two separate
sub-communicators are created:
```
splitCount 87: g52+g53 (nodes 4-5) -> nranks=16, color 1721561492  (remainder group)
splitCount 88: g2+g14+g50+g51 (nodes 0-3) -> nranks=32, color 2008404567  (main group)
```
Different colors = separate sub-communicators. Remainder nodes do NOT
appear in the main group's split.

#### Caveat

When `remainder = 1` (e.g., 5N with 4-node adjacency), the single
leftover node is excluded from that sub-group test -- inter-node
communication requires >= 2 nodes. It's still covered by the full N-node
allreduce/alltoall and ring P2P.

### Test results

All jobs ran on partition `amd-tw` with `NCCL_DEBUG=INFO`. Node g57
excluded due to faulty RDMA NIC.

| Nodes | Ranks | Slurm Job | Nodes Used | Exit |
|---|---|---|---|---|
| 1 | 8 | 477 | g59 | 0 |
| 2 | 16 | 478 | g2, g14 | 0 |
| 3 | 24 | 479 | g25, g26, g27 | 0 |
| 4 | 32 | 480 | g50, g51, g52, g53 | 0 |
| 5 | 40 | 481 | g15, g29, g54, g55, g59 | 0 |
| 6 | 48 | 482 | g2, g14, g50, g51, g52, g53 | 0 |
| 7 | 56 | 483 | g15, g25, g26, g27, g29, g54, g55 | 0 |
| 8 | 64 | 484 | g2, g14, g15, g25, g26, g27, g29, g50 | 0 |
| 9 | 72 | 485 | g2, g14, g15, g50, g51, g52, g53, g54, g55 | 0 |

#### Remainder group output (6N, `slurm-482.out` line 54476)

```
=======InterNodeComm - allreduce-4nodes (GB/s)=======
Hostname      Node  Rank  2MB    4MB    8MB    16MB   32MB   64MB   128MB  256MB  512MB  1024MB
tus1-p3-g2    0     0     3.73   5.31   5.03   5.84   10.96  7.55   7.41   8.01   15.01  29.27
tus1-p3-g52   4     32    32.46  53.65  77.70  96.48  135.12 175.10 215.54 324.20 336.77 344.32
```

Two group leaders reported: rank 0 (main 4-node group, nodes 0-3) and
rank 32 (remainder 2-node group, nodes 4-5).

#### Remainder group output (7N, `slurm-483.out` line 68237)

```
=======InterNodeComm - allreduce-4nodes (GB/s)=======
Hostname      Node  Rank  2MB    4MB    8MB    16MB   32MB   64MB   128MB  256MB  512MB  1024MB
tus1-p3-g15   0     0     4.28   5.64   5.49   6.03   11.37  7.55   7.40   8.11   18.00  30.02
tus1-p3-g29   4     32    9.94   14.09  14.52  15.25  30.92  20.28  19.93  20.25  40.16  77.46
```

Two group leaders: rank 0 (main 4-node group, nodes 0-3) and rank 32
(remainder 3-node group, nodes 4-6).

#### P2P on odd counts (no crash)

```
=======InterNodeComm - p2p-2nodes (GB/s)=======  [3N, slurm-479.out]
Hostname      Node  Rank  2MB    ...    1024MB
tus1-p3-g25   0     0     12.25  ...    20.40    <- paired (node 0 <-> node 1)
                                                    node 2 absent (unpaired, skipped)

=======InterNodeComm - p2p-2nodes (GB/s)=======  [5N, slurm-481.out]
Hostname      Node  Rank  2MB    ...    1024MB
tus1-p3-g15   0     0     10.83  ...    10.12    <- pair 1 (node 0 <-> node 1)
tus1-p3-g15   0     1     11.00  ...    9.98
...
                                                    node 4 absent (unpaired, skipped)
```

#### Ring P2P includes all nodes (odd counts)

```
=======InterNodeRingP2P bidirectional bandwidth - ringsize=5 - (GB/s)=======
ring 10MB   20MB   40MB   80MB   160MB
0    49.79  52.22  53.95  54.93  54.87
1    49.57  53.02  53.80  54.93  55.19
2    49.71  52.39  53.84  54.78  54.98
3    49.37  52.21  53.94  54.69  55.21
4    48.15  53.20  53.84  54.94  55.24
5    47.59  52.54  53.73  54.70  55.07
```

All 5 nodes participate in the ring (including node 4 which was excluded
from P2P pairing).

#### NCCL warnings

All NCCL WARN messages across all jobs are benign IB port-state
notifications:
```
NCCL WARN NET/IB : rdmaN:1 Got async error event: port active
```
No `NCCL ERROR`, `ncclSystemError`, or `SIGTERM` in any completed job.

### Out of scope

This PR addresses preflight benchmark correctness only. Early-stopping /
RCCL hang diagnosis is in progress in #689.
…-node tests (#706)

Allows skipping 2-node and 4-node subgroup inter-node comm tests,
running only the all-node allreduce/alltoall and ring p2p.
Move optional dependencies to lazy imports so preflight can run with
only torch installed. matplotlib is now imported only when --plot is
used; markdown2 and weasyprint are imported only inside md_to_pdf()
(skipped when --disable-pdf is passed).

---------

Co-authored-by: Andrew Ma <andrew.ma@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants