feat(preflight): configurable perf tests + new node-local smoke test#712
feat(preflight): configurable perf tests + new node-local smoke test#712amd-fuyuajin wants to merge 24 commits intodev/preflight-direct-testfrom
Conversation
…nd reports
Introduce CLI flags to select which perf tests to run, override message and
group sizes, and apply a fast pre-launch preset. Reorganize the report so
large clusters stay readable, and harden the dispatcher with clear mode
precedence and fail-fast validation.
Test selection
- Add --tests CSV with tokens: gemm, intra-allreduce, intra-alltoall,
inter-allreduce, inter-alltoall, inter-p2p, inter-ring-p2p, all.
- Replace the all-or-nothing run with a token-driven dispatch loop that
logs per-test wall-clock time.
- Add --quick preset (gemm + intra-allreduce + inter-allreduce, sizes
64,1024 MB, full intra/inter groups, lowered warmup/iters) for a fast
pre-launch sanity check.
Configurable sizes / groups
- Add --comm-sizes-mb (global) and per-scope overrides
--intra-comm-sizes-mb, --inter-comm-sizes-mb, --ring-p2p-sizes-mb.
- Add --intra-group-sizes and --inter-group-sizes (supports 'all' token).
- All knobs default to None so user-supplied values are detectable.
Runtime warmup/iteration overrides
- Add set_warmup/set_iteration/get_warmup/get_iteration in global_vars.
- Update square_gemm, intra_node_comm, inter_node_comm, inter_node_comm_p2p,
and inter_node_ring_p2p to read via the getters so --quick takes effect.
Report readability
- Add Node -> Hostname legend at the top of the perf report (also mirrored
to console).
- Rename the per-row Hostname column to "Leader hostname" and show only the
first node of each group.
- Use compact ranges for Node and Rank columns (e.g. 0-3, 0-15) via a new
format_int_range helper.
- Restore missing column headers in the console output for all perf tables.
Mode precedence and safety
- --tests and --quick auto-imply --perf-test.
- Perf intent wins over info selectors: when --perf-test/--tests/--quick is
mixed with --host/--gpu/--network, info selectors are dropped with an
explicit WARN (stderr + markdown).
- Tuning knobs set without any perf intent are inert and emit a quieter
WARN ("knob X has no effect without --perf-test/--tests/--quick").
- Centralize PERF_INTENT_FLAGS and PERF_TEST_TOKENS in preflight_args.
- Keep --no-split-nodes-subgroup as a deprecated alias.
Validation hardening
- _resolve_perf_config is now side-effect free; it returns warmup/iteration
for the caller to apply instead of mutating module globals.
- Validate intra/inter/ring knobs only when the corresponding tests are
selected (e.g. --tests gemm --intra-group-sizes 3 no longer aborts).
- Resolve and validate perf config BEFORE init_distributed so typos and bad
sizes fail in milliseconds instead of after a 120s NCCL rendezvous.
- Reject --tests values that yield zero valid tokens (e.g. ",,,") with a
clear error instead of silently running no perf tests.
- Drop unused format_host_range/_split_host_suffix and stale imports.
Made-with: Cursor
…creening
Add a lightweight, distributed-rendezvous-free smoke test that runs on every
node in parallel under SLURM and quickly identifies broken nodes before a
large-scale training job commits to a global rendezvous. Designed for the
common case where we own full nodes and care which *node* is sick, not which
GPU within an otherwise-healthy node.
Architecture
- primus/tools/preflight/node_smoke.py: per-node Python entry with three
argparse subcommands (run, aggregate, _per_gpu).
- runner/run_node_smoke_direct.sh: SLURM/bash wrapper, modeled after
run_preflight_direct.sh. No MASTER_ADDR / no torch.distributed rendezvous;
every node runs independently, NODE_RANK==0 aggregates.
Tier 1 (mandatory, ~5 s/GPU)
- For each GPU, spawn an isolated Python subprocess with a hard timeout that
performs torch.cuda.set_device, a 256 MB allocation, and a tiny GEMM with
an isfinite() check. Catches stale / hung GPUs that pass enumeration but
fail the first real op.
- Reuse existing collect_gpu_info / collect_host_info /
collect_network_info(expect_distributed=False) and add a recent-dmesg scan
for known hardware error patterns.
Tier 2 (optional perf sanity, --tier2 / --tier2-rccl)
- Per-GPU steady-state GEMM TFLOPS (8192^3 bf16) and HBM bandwidth measured
via device-to-device torch.Tensor.copy_ (counts read+write), gated by
thresholds.
- Node-local 8-GPU all-reduce via torch.multiprocessing.spawn over a
127.0.0.1 process group, with a hard timeout. Measures algorithmic
bandwidth at 64 MB. Iteration counts (warmup=5, iters=20 for GEMM and
RCCL; warmup=10, iters=20 for HBM) are aligned with the preflight --quick
preset so smoke and preflight report comparable steady-state numbers.
Per-node JSON + cluster aggregation
- Each node writes <dump>/smoke/<host>.json with status, fail_reasons,
duration, tier1 (per-GPU details + system probes), and tier2 sections.
- Aggregator on NODE_RANK==0 polls for the expected number of JSONs, then
emits:
* smoke_report.md with a status table, a Tier 2 perf summary
(per-node GEMM / HBM min/median/max + local RCCL GB/s), and a
"Failing nodes" detail section.
* passing_nodes.txt / failing_nodes.txt suitable for piping straight
into srun --nodelist / srun --exclude. Synthetic <missing-N>
placeholders for nodes that never reported are kept in the markdown
report but excluded from the txt files.
- Aggregator returns non-zero if any node FAILs or the expected count is
not met, so the wrapper script propagates a meaningful exit code.
Verified at scale
- Successful 6-node run on tus1-p3-g[14,15,25,26,27,29] with --tier2
--tier2-rccl: all nodes PASS, ~58 s wall clock per node, GEMM 702-733
TFLOPS, HBM 3.7-4.2 TB/s, local RCCL 197-201 GB/s.
- Cross-checked formulas against square_gemm.py and intra_node_comm.py:
identical AlgBW (2*S*(P-1)/P) and TFLOPS (2*N^3/t) definitions, so smoke
and preflight numbers are directly comparable.
Made-with: Cursor
…limit checks
The node-local smoke test previously caught GPU-level failures (Tier 1) and
optional perf regressions (Tier 2). It missed three of the most common
"job dies at minute 3" causes at scale: software-stack drift between nodes,
silently degraded RDMA NICs, and host limits that block RDMA pin / NCCL
shared-memory under load. This commit adds all three in Tier 1 with no
extra runtime (millisecond-scale sysfs reads).
A. Software-stack fingerprint + cluster drift detection
- New _collect_node_fingerprint() captures kernel, OS, Python, ROCm
(/opt/rocm/.info/version), amdgpu driver (/sys/module/amdgpu/version),
PyTorch + torch.version.hip, RCCL version (torch.cuda.nccl.version()),
librccl.so path, plus per-IB-device firmware (fw_ver) and HCA model.
- Aggregator computes the cluster-majority value for every scalar
fingerprint key and emits a "Stack drift across cluster" section
listing only outliers (e.g. one node on RCCL 2.21 while the rest are
on 2.22). NIC firmware drift is reported per-IB-device in its own
"NIC firmware drift" section so a flashed-differently NIC is named.
- Healthy clusters render *All nodes match.* placeholders so the report
stays short.
B. NIC / RDMA roll-call (per-port, from sysfs only)
- New _collect_nic_status() inventories every port under
/sys/class/infiniband (no ibv_devinfo / ibstat dependency, works
inside containers). Per port we capture state, phys_state, link rate,
netdev + MTU, total non-zero GIDs, and the RoCE v2 GID subset.
- Hard-fail rules (cause node FAIL): any port not ACTIVE / not LinkUp,
any active port with zero RoCE v2 GIDs, or NIC count != the optional
--expected-rdma-nics N.
- Aggregator's "NIC / RDMA roll-call issues" table pinpoints the
offending node + port + reason.
C. Host limits / system tunables
- New _collect_host_limits() captures RLIMIT_MEMLOCK, RLIMIT_NOFILE,
RLIMIT_NPROC, /dev/shm size + free, NUMA node count, CPU count, and
cpu0 scaling_governor.
- Hard-fail rules: RLIMIT_MEMLOCK finite and below --ulimit-l-min-gb
(default 32 GiB) -> "RDMA pin will fail under load"; /dev/shm size
below --shm-min-gb (default 8 GiB) -> "NCCL shared-mem may fail".
- Aggregator's "Host limits issues" section lists violators with the
exact value and required threshold.
Wiring + CLI
- Collectors are invoked unconditionally in _cmd_run after the existing
reused info collectors, stored under tier1.fingerprint / tier1.nics /
tier1.host_limits in the per-node JSON.
- _node_status_from() now adds nic: and host_limits: prefixed reasons
so per-node fail_reasons remain self-describing.
- New `run` flags:
--expected-rdma-nics N FAIL on count mismatch (default: report only)
--ulimit-l-min-gb GB FAIL threshold (default 32; 0 disables)
--shm-min-gb GB FAIL threshold (default 8; 0 disables)
- Wrapper script needs no changes; unknown flags are forwarded as-is.
Verified
- Live single-node run on tus1-p3-g25: fingerprint populated (ROCm
6.4.2, amdgpu 6.12.12, RCCL 2.28.9, NIC fw 231.2.63.0 across all 8
rdma devices); NIC roll-call shows 8/8 ports ACTIVE/LinkUp at 400 Gb/s,
MTU 9000, >=1 RoCE v2 GID each, 0 issues; host limits show memlock
405 GiB, /dev/shm 1.6 TiB, governor=performance, 0 fail_reasons. All
four new report sections render the *empty* placeholders cleanly.
- Synthetic two-node drift test (one real + one edited copy): outlier
node correctly surfaces in Stack drift (rccl, amdgpu_driver), NIC
firmware drift (rdma3 only), NIC issues (rdma2:1 DOWN), and Host
limits (memlock 64 MiB violation). Per-node fail_reasons and exit
code propagate as expected.
Made-with: Cursor
…dd port-count outlier section In _stack_drift_rows() the comparison-key set was populated whenever any node reported a key as a scalar OR as None. On a heterogeneous cluster (some nodes with an IB stack, some without) "nic_fw" is None on one node and a dict on the others. The dict then reached collections.Counter and crashed the aggregator with `TypeError: unhashable type: 'dict'`. The crash happened mid-write, so smoke_report.md was truncated and passing_nodes.txt / failing_nodes.txt were never produced -- so an 18-node SLURM run that successfully wrote 17 per-node JSONs ended up with no usable cluster verdict. Changes - _stack_drift_rows: only collect a key when at least one node reports it as a real scalar (drop the "None counts as scalar" path); plus a defense-in-depth isinstance check inside the per-host loop so the same crash is impossible if a future schema mixes scalar and dict for the same key. - Wrap each report section (Stack drift, NIC firmware drift, NIC issues, Host limits) in its own try/except. A bug in one section now records "*Section X failed to render: ...*" inline and the rest of the report still gets written. - Add a "NIC port-count summary" section that always renders, computes the cluster-majority port count, and lists every node that disagrees. This catches partial-NIC-degradation cases (e.g. one node enumerating 0 or 7 of 8 RDMA NICs) without requiring --expected-rdma-nics. Wrapped in try/except like the others. Verified - Local repro of the original failure (one node nic_fw=None, one node nic_fw=dict): aggregator now exits 0 and writes a complete report, with the port-count outlier surfaced in the new summary section. - Existing single-node and synthetic-drift smoke flows still produce the same output, including the empty-state placeholders on a homogeneous cluster.
… reported - Normalize host -> short name in _cmd_run (JSON filename + host field) and defensively in _cmd_aggregate so legacy FQDN JSONs produce SLURM-ready passing/failing txt files without re-running the smoke. - New `aggregate --expected-nodelist-file FILE`: missing nodes are named by their real short hostname (instead of <missing-N>) and written directly to failing_nodes.txt. - runner/run_node_smoke_direct.sh: rank 0 auto-populates the file from `scontrol show hostnames "$SLURM_JOB_NODELIST"`. Best-effort.
…ocm-smi self-latency
Adds four new Tier 1 collectors and matching aggregator sections so the
smoke test catches a broader class of "node will silently degrade
training" failures before launch.
Per-node collectors (one call per node, results in tier1.<key>):
- gpu_low_level: amd-smi metric --json (text fallback) -> per-GPU power,
GFX clock, edge temp, ECC counters, throttle status. Schema-tolerant.
- xgmi: amd-smi topology -> parses the LINK TYPE TABLE into a
BDF-indexed square matrix; records every non-XGMI pair.
- clock: time.time(), monotonic, and systemctl is-active for
chronyd/ntp/ntpd/systemd-timesyncd.
- tooling: times rocm-smi --version against a hard timeout
(default 5 s) -- a wedging amdgpu driver typically hangs rocm-smi
for 30-60 s before the GPU itself stops responding.
Per-GPU subprocess gain (D-1 light, sysfs + torch only, no shell-out):
- details.low_level: pci_bdf, pcie_link_speed_{raw,gts}, pcie_link_width
(from /sys/bus/pci/devices/<bdf>) plus hbm_total_bytes/hbm_free_bytes
(torch.cuda.mem_get_info).
New hard fails in _node_status_from:
- per-GPU ecc_uncorrectable_total > 0 -> node FAIL.
- any non-XGMI pair in the topology matrix -> node FAIL (intra-node
collectives silently fall back to PCIe and lose 5-10x bandwidth).
- rocm-smi --version timeout -> node FAIL (driver wedging signal).
Throttle reasons and time-daemon health are recorded but not failed-on
(schema is too vendor-specific / cluster-culture-specific for a default).
New CLI flags:
- run --rocm-smi-timeout-sec (default 5.0)
- aggregate --rocm-smi-warn-sec (default 1.0)
- aggregate --clock-skew-warn-sec (default 30.0; loose because the
spread also includes srun launch jitter)
New aggregator sections in smoke_report.md (each wrapped in its own
try/except so a single bug can never truncate the report):
- GPU low-level outliers (PCIe link / HBM): per-GPU values that diverge
from the cluster majority, listed as host:gpu = value.
- XGMI link issues: per-node, with up to 6 sample non-XGMI pairs each.
- Cluster clock + time daemons: wall-clock spread (with earliest/latest
hosts) plus a sub-table of any nodes with no active time-sync daemon.
- Tooling self-latency: any node that hit the rocm-smi timeout (FAIL)
or exceeded --rocm-smi-warn-sec.
Verified locally: amd-smi metric/topology/rocm-smi calls all complete;
XGMI parser handles the real multi-section BDF-labelled output (8x8 SELF
on diagonal, all XGMI off-diagonal); end-to-end run + aggregate produces
a clean smoke_report.md with all four new sections rendering cleanly.
…PU / stale-driver nodes can't PASS Previously, a node where torch.cuda.device_count() resolved to 0 could silently PASS smoke if `_collect_reused_info()` failed to surface the "No GPUs detected" finding -- e.g. when collect_gpu_info() raises and the wrapper downgrades the failure to level="warn". That's exactly the class of failure (stale ROCm install, wedged amdgpu driver) the smoke test exists to catch, so the FAIL must not depend on any other collector's behavior. Adds a self-contained guard in _cmd_run that captures every independent GPU-count source -- the --expected-gpus flag, LOCAL_WORLD_SIZE, GPUS_PER_NODE, torch.cuda.is_available(), torch.cuda.device_count(), and (after _collect_amd_smi_metrics) amd-smi -- into a new tier1.gpu_visibility block, with two hard-fail rules: 1. expected_gpus < 1 -> hard fail with full diagnostic context. 2. amd-smi sees more GPUs than torch -> hard fail. This is the high-signal stale-ROCm / wedged-driver signature. _node_status_from now prepends gpu_visibility:* reasons before any other check, so the visibility verdict is independent of the reused gpu_info collector. The aggregator gets a dedicated "## GPU visibility issues" section that surfaces expected / torch / amd-smi counts side by side per node. Verified locally: on a host where torch can't see GPUs but amd-smi sees 8, both reasons land in fail_reasons ahead of any reused-collector finding and the node correctly FAILs.
…ubstring
The dmesg scanner used `p in ll` (substring match) over a list that
included regex-looking patterns like "amdgpu.*error". As a result the
amdgpu pattern essentially never fired against real kernel lines:
amdgpu 0000:05:00.0: amdgpu_device_resume failed: -19
amdgpu: [drm] *ERROR* ring sdma0 timeout, signaled seq=12345
both slipped past the scan, defeating the dmesg check for the most
common amdgpu failure modes.
Switch to compiled regex matching with re.IGNORECASE. Patterns are
documented as regex-by-contract; a malformed pattern is recorded into
the dmesg block's `pattern_errors` field and never aborts the scan.
Pattern changes:
- "xid" -> r"\bxid\b" (avoid matching auxiliary etc.)
- "amdgpu.*error" -> r"amdgpu.*(error|fail|timeout)" (real formats)
- added r"\*error\*" (catches "[drm] *ERROR*")
All previously-literal patterns ("hardware error", "gpu reset",
"hung_task", "soft lockup", ...) work unchanged because they contain
no regex metacharacters.
Verified against real amdgpu / NVRM Xid / MCE / soft-lockup /
hung_task / page-allocation lines (all match) and benign systemd /
audit lines (none match).
…mp cleanup Several real-cluster paper-cuts uncovered while running node-smoke on 4-8 node SLURM jobs. None of these change the diagnostic content of the report -- they fix surprising / wrong behaviour around the edges. 1. Per-node `run` now always exits 0 when the JSON was written. Previously it returned 1 whenever a node was diagnosed FAIL, so srun printed one "error: ... task N: Exited with exit code 1" line per bad node. That conflates "this node is broken" (a successful diagnosis) with "this tool crashed" (a real failure) and made it look like the smoke test itself was broken whenever it correctly identified a problem. The cluster-health verdict still flows out via the aggregator's exit code on rank 0 (single CI-friendly signal) and via failing_nodes.txt; tool-internal failures still propagate non-zero through Python's default exception handling. 2. Replace --tier2 + --tier2-rccl with a single --tier2-perf flag. The old pair allowed --tier2-rccl on its own to silently skip RCCL (because runtime required both flags), and --tier2 alone silently skipped RCCL on single-GPU nodes. Both gave false coverage confidence. --tier2-perf now turns on GEMM + HBM + node-local RCCL together. The `run` subparser uses allow_abbrev=False so an old `--tier2` left in a script errors out loudly instead of being silently prefix-matched to --tier2-perf. A warn is emitted up front if --tier2-perf is requested on a node with < 2 visible GPUs so the RCCL skip is never silent. 3. Robust PCIe BDF resolution (_resolve_gpu_bdf). torch.cuda.get_device_properties(i).pci_bus_id is polymorphic across PyTorch + ROCm versions: sometimes a canonical string, sometimes just the bus byte as int. The old code called .lower() on it and crashed inside a broad try/except, silently losing PCIe link width / speed and HBM totals from the report. The new helper handles both string and int forms, verifies sysfs paths, and the per-GPU low- level capture splits PCIe and HBM into independent try blocks with dedicated error keys so one missing piece never costs us the other. 4. Auto-clean stale artifacts in --dump-path on rank 0 at startup (_clean_dump_path), with --no-clean-dump-path to opt out. Without this, a re-run on a smaller nodelist would leave per-node JSONs from removed nodes in <dump>/smoke/ and the aggregator would happily count them as PASS, contaminating the report. Cleanup is rank-0 only and runs before any rank can have written its current-run JSON, so it is race-safe. 5. runner/run_node_smoke_direct.sh: docstring updated to mention --tier2-perf instead of the removed --tier2 / --tier2-rccl.
…GPUs
The single most common reason a "healthy" cluster fails to launch a large
training job is that a previous job's Python ranks are still attached to
the GPUs (held HBM, half-torn-down NCCL communicators, or just stuck in
__del__). Symptoms in the new job: torch.cuda.OutOfMemoryError at model
init with a misleading "free=Y" message, NCCL/RCCL bootstrap hang, or
random ranks failing the first all-reduce due to compute contention.
This commit adds three Tier 1 checks (all node-level, all run before any
per-GPU subprocess attaches to the device, so we only see foreign work):
1. Foreign / leaked process enumeration -- _collect_gpu_processes()
Tries `amd-smi process --json` -> `amd-smi process` (text) -> `lsof
/dev/kfd /dev/dri/renderD*` and records {pid, name, hbm_bytes,
is_self, is_allowed, is_foreign} per GPU. A PID is treated as ours
(and excluded) if its pgid matches our own; everything else is
foreign unless its name is in --allowed-procs (e.g.
"rocm-smi-daemon,amd-smi,dcgm-exporter"). Hard FAIL by default;
--allow-foreign-procs downgrades to report-only.
2. Pre-touch HBM-busy check -- in _per_gpu_body
torch.cuda.mem_get_info is now called BEFORE we allocate anything on
the GPU, so the "used" reading reflects only foreign occupancy. Hard
FAIL if any GPU has > --hbm-busy-threshold-gib (default 2.0) used at
that point. The previous post-test reading is biased by PyTorch's
caching allocator (which doesn't truly release pages on
empty_cache()) and was therefore not safe to threshold-check.
3. GPU compute-activity warn -- gfx_activity_pct in _flatten_amd_smi_metric_json
Surfaces gpus reporting >= --gpu-activity-warn-pct (default 20%) at
smoke start. Warn-only because short bursts are normal, but a
sustained pegged-100% across multiple GPUs strongly indicates a
leaked rank still running compute.
Aggregator output (smoke_report.md):
## Busy GPUs / leaked processes
| Node | Hostname | GPU | PID | Process | HBM held (GiB) |
## GPU pre-touch HBM usage outliers
| Node | Hostname | GPU | HBM used pre-touch (GiB) |
## GPU compute-activity outliers
| Node | Hostname | GPU | Activity % |
failing_nodes.txt now includes any node with a foreign GPU process or
excessive pre-touch HBM, so the operator can `srun --exclude=` them or
`pkill -9 -f train.py` and retry.
New CLI flags (run):
--hbm-busy-threshold-gib N FAIL if pre-touch HBM used > N GiB. Default 2.0.
--allow-foreign-procs Downgrade foreign-process FAIL to report-only.
--allowed-procs name1,name2 Whitelist known agents.
--gpu-activity-warn-pct N Aggregator warn threshold. Default 20.
The same threshold flags are mirrored on `aggregate` so the report
labels its sections with the numbers each node was configured with, and
on the internal _per_gpu subcommand so the spawned subprocess receives
--hbm-busy-threshold-gib.
Verified:
- Real 8-GPU node, no foreign processes -> sections render with
reassuring "no issues" text; gpu_processes.tool == "amd-smi process
--json", foreign_count == 0.
- Synthetic JSON with 2 foreign PIDs + 1 pre-touch outlier + 2 active
GPUs -> all three tables populate; idle/clean GPUs filtered out.
- _node_status_from default -> precise FAIL message with PID/name/HBM;
--allow-foreign-procs -> no FAIL (still in report).
…d amd-smi process JSON parser
Two related gaps in busy-GPU / leaked-process detection: (1) checks
silently no-op'd when amd-smi was missing, and (2) on nodes where
amd-smi is present, the modern (>=6.x) `amd-smi process --json` schema
broke our parser so the operator-facing "who is holding the GPU" table
came back empty -- even though pre-touch HBM had correctly flagged the
node as FAIL.
Tooling availability + rocm-smi fallbacks
-----------------------------------------
- Inventory amd-smi / rocm-smi / lsof at runtime; emit a loud WARN on
rank 0 listing exactly which checks lose coverage.
- Always-on "Tooling availability" section in the aggregator report,
with per-tool presence and per-check fallback status.
- `run --require-tools <csv>` promotes missing required tools to a hard
node FAIL for strict CI environments.
- Add four rocm-smi fallback parsers producing the same per-GPU schema
as their amd-smi counterparts:
* `_rocm_smi_ras_info_text` -> ECC counters
* `_rocm_smi_topotype_json` -> XGMI link matrix
* `_rocm_smi_processes` -> foreign processes (--showpids)
* `_rocm_smi_use_json` -> gfx_activity_pct (--showuse)
Wired into `_collect_amd_smi_metrics`, `_collect_xgmi_topology`, and
`_collect_gpu_processes` so coverage stays close to full when only
rocm-smi is installed.
- Default `--allowed-procs` now includes node-resident agents
(`gpuagent`, `rocm-smi-daemon`, `amd-smi`, `dcgm-exporter`).
amd-smi process JSON parser fix
-------------------------------
Real `amd-smi process --json` output (verified on a busy MI300X) is
double-nested in two ways the old parser didn't handle:
[{"gpu": 0, "process_list": [
{"process_info": { <-- extra wrapper
"pid": 2669301,
"memory_usage": {
"vram_mem": {"value": 23044481024, "unit": "B"} <-- dict
}
}}
]}]
The old code did `p.get("pid")` directly on the `{"process_info": ...}`
wrapper -> got None -> silently dropped every process. Even if it had
reached `_hbm_of`, the dict-with-unit memory shape wasn't recognised.
Net effect: `gpu_processes.foreign_count == 0` on nodes that visibly
had 8x ~23 GB leaked python ranks holding HBM.
- New `_unwrap_proc()` peels off `process_info` if present, so modern
and older amd-smi shapes flow through one path.
- New `_value_unit_to_bytes()` resolves int / formatted string /
`{"value": N, "unit": "..."}` uniformly via `_parse_size_with_unit`.
- Updated docstring to record all three real-world shapes (modern A,
older flat A', per-process B).
rocm-smi --showpids parser also extracts VRAM bytes
---------------------------------------------------
The documented field order is `name, num_gpus, vram_bytes, sdma,
cu_occupancy`. We were only taking field 0 (name) and passing None for
hbm_bytes, so even when the rocm-smi fallback fired the operator could
see which PIDs were leaked but not how much VRAM each was holding.
Now also takes field 2 as VRAM bytes (best-effort; tolerates older
shapes and dict/list values).
Verified against real captures from a busy node:
amd-smi process --json: 0 PIDs (before) -> 8 PIDs flagged foreign,
~23.04 GB / 21.46 GiB each
rocm-smi --showpids : 10 PIDs, no HBM -> 10 PIDs, 8 python3.11
foreign ~22.8 GiB each,
2 gpuagent allowed
amd-smi process, rocm-smi --showpids, and lsof /dev/kfd report PIDs in the **root (host) PID namespace** -- KFD knows nothing about user namespaces. os.getpid() returns the PID *as our own namespace sees it*. On bare metal or SLURM + pyxis/enroot (shared host PID ns by default) the two are equal and the naive `reported_pid == os.getpid()` test in `_collect_gpu_processes` works. Inside Docker (default) or any k8s pod the two differ -- causing our own training rank to be flagged `is_foreign=True` and (with the default policy) failing the node. Fix: new `_resolve_self_pid_view()` parses the NSpid line in /proc/self/status to recover our root-namespace PID. The matcher in `_collect_gpu_processes` now uses that host-side PID directly. The pgid-match path is preserved on bare metal but skipped inside a private PID namespace (os.getpgid on host PIDs we cannot see would always ESRCH). Output JSON gains `self_host_pid`, `pid_namespaced`, and the full `ns_pid_chain` for forensics across container boundaries. Verified: bare metal, private PID ns w/ own rank + leak, private PID ns w/ leak + allowed agent -- all classify correctly. The private-PID-ns-w/-own-rank case was the bug (previously foreign=2, now foreign=1). Net effect: zero behavior change on SLURM + pyxis/enroot; own rank no longer false-flagged on Docker / k8s.
The 4,487-line node_smoke.py is now a node_smoke/ package with one
module per responsibility, while preserving:
- the `python -m primus.tools.preflight.node_smoke` entry point
- CLI flags, help text, and exit-code semantics for run/_per_gpu/aggregate
- JSON schema/keys and markdown report section order
- runner/run_node_smoke_direct.sh wrapper behavior
Layout:
types.py, logging_utils.py, shell_utils.py leaf helpers
per_gpu.py, rccl_local.py in-process workloads
collectors/ per-area data gatherers
(dmesg, fingerprint, nics,
host_limits, gpu_low_level,
gpu_processes, xgmi, clock,
rocm_smi, tooling, reused_info)
orchestrator.py spawn _per_gpu + status roll-up
aggregator/summarizers.py row/summary helpers
aggregator/report.py markdown writer, one helper per
## section (was ~700-line block)
cli.py argparse + run/_per_gpu/aggregate
tests/test_node_smoke.py 22 unit + parity tests
Tier 2 perf summary and "Failing nodes -- full reasons" keep their existing
error-handling exactly (no new try/except). Verified end-to-end with the
entrypoint matrix and a JSON/markdown diff against a pre-refactor baseline
(time-variant fields allow-listed).
Docs: docs/node-smoke.md updated with the new module layout, dependency
diagram, refreshed flag tables, and a design-overview entry in History.
…mi reports 'N/A'
Some amd-smi / rocm-smi builds emit `name="N/A"` (or "", "none", "-",
"unknown", ...) for kernel/system-owned PIDs like `gpuagent` because they
cannot read /proc/<pid>/comm themselves. The allowlist check is purely
name-based, so these placeholders never matched any whitelisted name and
every healthy node with a running gpuagent was incorrectly FAILed with
`gpu_processes: foreign process(es) holding GPU(s) ... name='N/A'`.
Fall back to /proc/<pid>/comm (then /proc/<pid>/status `Name:`) inside
`_annotate` whenever the upstream name is a known placeholder, then
re-evaluate `is_allowed` against the resolved name. Original placeholder
is preserved on the per-process record under `name_raw` plus a boolean
`name_resolved_from_proc` so the JSON keeps the audit trail.
Side benefit: the report's "Busy GPUs / leaked processes" table now shows
real names ('python', 'gpuagent', ...) instead of 'N/A', so operators can
finally see what to pkill on nodes with actual leaked ranks.
Co-authored-by: Cursor <cursoragent@cursor.com>
…art, recommend smoke-then-preflight workflow - preflight.md: rewrite as the comprehensive reference for the now-configurable preflight tool. Cover info-only / perf-only / default mode precedence; --tests token list; --quick preset substitutions; per-knob tuning of message sizes, group sizes, ring-P2P sizes, and plotting; reliability knobs (--comm-cleanup-delay-sec, --dist-timeout-sec); validation behavior (fail-fast before NCCL init); reporting flags; backward-compat aliases; comparison with node-smoke; recommended pre-launch sequence. - preflight-direct.md: turn into a quick-start guide. Add a top-of-file "Which test should I run?" comparison plus a 3-step smoke-then-preflight workflow snippet. Replace the monolithic example block with 10 labeled subsections (A-J) covering every configurable knob. Document the minimum-dependency install matrix per tool/mode (torch is the only hard requirement; markdown2/weasyprint only for PDFs; matplotlib only for --plot); flag the existing requirements.txt path as 'full Primus runtime' and not necessary for preflight/smoke alone. Add a one-line callout before the multi-node example to verify NCCL_IB_HCA / NCCL_IB_GID_INDEX / NCCL_SOCKET_IFNAME / GLOO_SOCKET_IFNAME before launching. - node-smoke-test-instruction.md (new): short get-started guide for node-smoke, organized as quick-start + 10 example subsections per configurable knob + outputs + cheat sheet + troubleshooting. Links back to node-smoke.md for the full reference. - node-smoke.md: refine the opening paragraph to drop the awkward "GPU vs node" framing (training jobs allocate whole nodes anyway, so node-granularity verdicts are the right unit). Add a callout pointing newcomers at node-smoke-test-instruction.md.
amd-ama10002-2
left a comment
There was a problem hiding this comment.
LGTM overall — approving. Nice work on the smoke test, and the new unit tests are a great addition. Two follow-up requests below; happy for these to land in a separate PR if it's easier.
1. I think the new unit tests aren't actually put into the CI?
The test file lives at primus/tools/preflight/node_smoke/tests/test_node_smoke.py (in-source), but our CI only points pytest at ./tests/?
2. I'm not sure if the tests cover all of the new and updated features
I didn't manually test the new features but I think it would be good in the future to make it easy to test our features so that we avoid introducing new bugs.
Also, a nit, I tend to prefer having 1 new or updated feature == 1 PR, so that we have small, frequent PRs. I find this easier to do PR reviews and it lets us merge features iteratively instead of in a large batch. In this example, I probably would've liked this to be multiple smaller PRs 🤷♂️
Just something to consider for the future🙂. We don't need to spend time breaking up this PR at all 🙂👍
Thanks again for the work!
…s / HBM-busy checks Two related hardenings to the per-node smoke test so a sick node can no longer slip through as clean: - gpu_processes.py: previously, when `amd-smi process --json` returned rc=0 with valid JSON but an unknown / future schema, the parser returned [] and the caller still set ok=True, foreign_count=0 -- the rocm-smi / lsof fallbacks never ran. The empty-result case was indistinguishable from "schema matched, no processes" because `_flatten_amd_smi_process_json` only registered a per-GPU bucket when it pushed at least one process. Fix: register the per-GPU bucket as soon as a Shape A / A' entry is recognized (even with an empty `process_list`). Now [] unambiguously means "schema didn't match", and `_collect_gpu_processes` gates ok=True on a non-empty parsed result, falling through to text / rocm-smi / lsof on schema mismatch and recording a clear json_parse_error. - per_gpu.py: tighten the pre-touch HBM-busy check from `used_b > hbm_busy_threshold_bytes` to `>=`, so a GPU sitting exactly at the threshold is treated as busy (likely leaked from a previous job) instead of squeaking through.
|
@amd-ama10002-2 Thanks for your suggestions.
|
Here is an example of the node smoke test summary report. The output also include per-node json files with detailed info, a passing node list and a failing node list, which can be used to exclude in the following srun command. They are not shown here.Node-Local Smoke Test Report
Stack drift across cluster
NIC firmware drift across clusterAll NIC firmwares match (or no NICs reported). NIC / RDMA roll-call issuesNo NIC issues. NIC port-count summaryCluster-majority port count: 8 (seen on 16/16 nodes). Every node reports the majority count. Host limits issuesNo host-limit issues. GPU visibility issuesEvery node resolved expected_gpus >= 1 and torch + amd-smi agree on the GPU count. GPU low-level outliers (PCIe link / HBM)All GPUs match the cluster majority on PCIe link and HBM total. XGMI link issuesAll GPU pairs report XGMI on every node (or amd-smi topology was unavailable). Cluster clock + time daemons
Every node has at least one active time-sync daemon. Tooling self-latency (
|
| Node | Hostname | GPU | PID | Process | HBM held (GiB) |
|---|---|---|---|---|---|
| 3 | tus1-p3-g25 | 0 | 987322 | python3 |
? |
| 3 | tus1-p3-g25 | 0 | 987790 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 0 | 987791 | sglang::schedul |
211.64 |
| 3 | tus1-p3-g25 | 0 | 987792 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 0 | 987793 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 0 | 987794 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 0 | 987795 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 0 | 987796 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 0 | 987797 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 0 | 987798 | sglang::detoken |
? |
| 3 | tus1-p3-g25 | 1 | 987322 | python3 |
? |
| 3 | tus1-p3-g25 | 1 | 987790 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 1 | 987791 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 1 | 987792 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 1 | 987793 | sglang::schedul |
212.48 |
| 3 | tus1-p3-g25 | 1 | 987794 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 1 | 987795 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 1 | 987796 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 1 | 987797 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 1 | 987798 | sglang::detoken |
? |
| 3 | tus1-p3-g25 | 2 | 987322 | python3 |
? |
| 3 | tus1-p3-g25 | 2 | 987790 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 2 | 987791 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 2 | 987792 | sglang::schedul |
212.47 |
| 3 | tus1-p3-g25 | 2 | 987793 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 2 | 987794 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 2 | 987795 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 2 | 987796 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 2 | 987797 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 2 | 987798 | sglang::detoken |
? |
| 3 | tus1-p3-g25 | 3 | 987322 | python3 |
? |
| 3 | tus1-p3-g25 | 3 | 987790 | sglang::schedul |
212.44 |
| 3 | tus1-p3-g25 | 3 | 987791 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 3 | 987792 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 3 | 987793 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 3 | 987794 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 3 | 987795 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 3 | 987796 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 3 | 987797 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 3 | 987798 | sglang::detoken |
? |
| 3 | tus1-p3-g25 | 4 | 987322 | python3 |
? |
| 3 | tus1-p3-g25 | 4 | 987790 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 4 | 987791 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 4 | 987792 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 4 | 987793 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 4 | 987794 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 4 | 987795 | sglang::schedul |
212.35 |
| 3 | tus1-p3-g25 | 4 | 987796 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 4 | 987797 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 4 | 987798 | sglang::detoken |
? |
| 3 | tus1-p3-g25 | 5 | 987322 | python3 |
? |
| 3 | tus1-p3-g25 | 5 | 987790 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 5 | 987791 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 5 | 987792 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 5 | 987793 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 5 | 987794 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 5 | 987795 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 5 | 987796 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 5 | 987797 | sglang::schedul |
212.2 |
| 3 | tus1-p3-g25 | 5 | 987798 | sglang::detoken |
? |
| 3 | tus1-p3-g25 | 6 | 987322 | python3 |
? |
| 3 | tus1-p3-g25 | 6 | 987790 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 6 | 987791 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 6 | 987792 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 6 | 987793 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 6 | 987794 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 6 | 987795 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 6 | 987796 | sglang::schedul |
212.23 |
| 3 | tus1-p3-g25 | 6 | 987797 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 6 | 987798 | sglang::detoken |
? |
| 3 | tus1-p3-g25 | 7 | 987322 | python3 |
? |
| 3 | tus1-p3-g25 | 7 | 987790 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 7 | 987791 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 7 | 987792 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 7 | 987793 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 7 | 987794 | sglang::schedul |
212.19 |
| 3 | tus1-p3-g25 | 7 | 987795 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 7 | 987796 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 7 | 987797 | sglang::schedul |
? |
| 3 | tus1-p3-g25 | 7 | 987798 | sglang::detoken |
? |
GPU pre-touch HBM usage outliers
GPUs with more than 2.0 GiB of HBM already in use BEFORE smoke touched the device. This number is not polluted by our own caching allocator (it's measured before any allocation), so it directly reflects foreign or leaked occupancy.
| Node | Hostname | GPU | HBM used pre-touch (GiB) |
|---|---|---|---|
| 3 | tus1-p3-g25 | 0 | 213.14 |
| 3 | tus1-p3-g25 | 1 | 212.34 |
| 3 | tus1-p3-g25 | 2 | 213.17 |
| 3 | tus1-p3-g25 | 3 | 213.18 |
| 3 | tus1-p3-g25 | 4 | 212.89 |
| 3 | tus1-p3-g25 | 5 | 213.04 |
| 3 | tus1-p3-g25 | 6 | 212.92 |
| 3 | tus1-p3-g25 | 7 | 212.9 |
GPU compute-activity outliers
No GPU exceeded gfx_activity_pct >= 20.0% at smoke start (or amd-smi did not report activity).
Tier 2 perf summary
Per-node GEMM TFLOPS (8192^3 bf16) and HBM GB/s shown as min / median / max across the node's GPUs. RCCL GB/s is the node-local 8-GPU all-reduce algorithmic bandwidth at 64 MB.
| Node | Hostname | GEMM TFLOPS (min/med/max) | HBM GB/s (min/med/max) | Local RCCL GB/s |
|---|---|---|---|---|
| 0 | tus1-p3-g2 | 756.5 / 760.0 / 762.1 | 4406.3 / 4421.6 / 4449.7 | 268.6 |
| 1 | tus1-p3-g14 | 748.5 / 762.7 / 764.7 | 4399.8 / 4437.2 / 4534.7 | 271.1 |
| 2 | tus1-p3-g15 | 750.4 / 757.9 / 764.4 | 4359.1 / 4417.6 / 4423.9 | 269.9 |
| 3 | tus1-p3-g25 | 269.8 | ||
| 4 | tus1-p3-g26 | 761.3 / 764.6 / 768.2 | 4388.6 / 4425.3 / 4435.8 | 269.2 |
| 5 | tus1-p3-g27 | 745.9 / 756.4 / 764.4 | 4409.3 / 4419.7 / 4429.6 | 269.2 |
| 6 | tus1-p3-g29 | 756.2 / 764.6 / 769.5 | 4395.3 / 4419.5 / 4442.0 | 268.8 |
| 7 | tus1-p3-g32 | 752.1 / 763.7 / 769.6 | 4402.3 / 4411.7 / 4430.1 | 270.0 |
| 8 | tus1-p3-g50 | 754.2 / 762.5 / 767.3 | 4359.9 / 4424.3 / 4477.9 | 268.8 |
| 9 | tus1-p3-g51 | 749.4 / 765.8 / 768.5 | 4404.1 / 4417.2 / 4445.4 | 268.8 |
| 10 | tus1-p3-g52 | 747.0 / 762.8 / 771.7 | 4401.7 / 4418.7 / 4453.4 | 270.3 |
| 11 | tus1-p3-g53 | 748.9 / 761.9 / 768.2 | 4399.0 / 4419.0 / 4446.0 | 267.6 |
| 12 | tus1-p3-g54 | 758.9 / 762.3 / 764.8 | 4403.9 / 4428.4 / 4444.6 | 270.4 |
| 13 | tus1-p3-g55 | 751.6 / 762.5 / 769.5 | 4408.9 / 4422.0 / 4432.9 | 269.9 |
| 14 | tus1-p3-g57 | 748.3 / 759.0 / 768.0 | 4399.3 / 4414.6 / 4427.6 | 270.1 |
| 15 | tus1-p3-g59 | 750.0 / 760.5 / 762.9 | 4407.6 / 4422.7 / 4431.4 | 269.8 |
Failing nodes -- full reasons
tus1-p3-g25
- gpu0: FAIL: pre-touch HBM busy: 213.14 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
- gpu1: FAIL: pre-touch HBM busy: 212.34 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
- gpu2: FAIL: pre-touch HBM busy: 213.17 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
- gpu3: FAIL: pre-touch HBM busy: 213.18 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
- gpu4: FAIL: pre-touch HBM busy: 212.89 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
- gpu5: FAIL: pre-touch HBM busy: 213.04 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
- gpu6: FAIL: pre-touch HBM busy: 212.92 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
- gpu7: FAIL: pre-touch HBM busy: 212.9 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
- gpu_processes: 80 foreign process(es) holding GPU(s) (e.g. gpu0: pid=987322 name='python3'; gpu0: pid=987790 name='sglang::schedul'; gpu0: pid=987791 name='sglang::schedul' hbm=211.64GiB) -- likely leaked rank(s) from a previous job. Clean up with
pkill -9 -f train.py(or similar) or pass --allow-foreign-procs.
Summary
This PR adds two complementary cluster-diagnostic capabilities on top of the existing
preflighttool, plus a comprehensive doc rewrite. The recommended pre-launch workflow becomes smoke first → preflight second:preflight— the existing global-rendezvous tool gains per-test selection (--tests), tuning knobs (message sizes, group sizes, ring-P2P sizes), a--quickpreset, and reliability flags (--dist-timeout-sec,--comm-cleanup-delay-sec).node-smoke— a distributed-rendezvous-free per-node screen that runs Tier 1 (always) + optional Tier 2 perf checks on every node in parallel under SLURM, returns one PASS/FAIL verdict per node, and writes SLURM-readypassing_nodes.txt/failing_nodes.txt. Implemented as a new sub-packageprimus/tools/preflight/node_smoke/with its own wrapperrunner/run_node_smoke_direct.sh.What's changed
1. Configurable preflight perf tests
--tests,--comm-sizes-mb,--intra-comm-sizes-mb,--inter-comm-sizes-mb,--intra-group-sizes,--inter-group-sizes,--ring-p2p-sizes-mb,--quick,--dist-timeout-sec,--comm-cleanup-delay-sec.--perf-test/--tests/--quick) wins over info selectors (--host/--gpu/--network); info-only mode never initializestorch.distributed.[Primus:Preflight] <test> done in <T>s.--check-host/--check-gpu/--check-networkand--no-split-nodes-subgroupstill work.2. Node-local smoke test (new)
set_device+ 256 MB alloc + tiny GEMM withisfinite()check, plus reused info collectors, dmesg recent-error scan, software-stack fingerprint, NIC/RDMA roll-call, host limits, GPU low-level (PCIe link / HBM / ECC / throttle), XGMI link matrix, clock skew + time-daemon health, foreign-process detection, tooling self-latency canary.--tier2-perf): GEMM TFLOPS, HBM GB/s, local 8-GPU RCCL all-reduce GB/s with configurable thresholds.NODE_RANK==0: cluster Markdown report with stable section ordering, per-node JSON, drift detection, and pass/fail txt outputs.set_deviceisSIGKILL'd without affecting peers); short hostnames; PID-namespace-aware self-detection;/proc/<pid>/commfallback whenamd-smi processreturnsname="N/A"for kernel/system PIDs likegpuagent.try/except-wrapped so a bug in one section can't truncate the rest.collectors/,aggregator/,orchestrator.py,per_gpu.py,rccl_local.py,cli.py, ...). Single public entry:python -m primus.tools.preflight.node_smoke run|aggregate|_per_gpu.3. Documentation
docs/preflight.md— rewritten as the comprehensive reference for the configurable preflight tool.docs/preflight-direct.md— quick-start guide forrunner/run_preflight_direct.sh. Adds:torchis the only hard requirement;markdown2/weasyprintonly for PDFs;matplotlibonly for--plot); explicitly notesrequirements.txtis not necessary for these tools alone.NCCL_IB_HCA/NCCL_IB_GID_INDEX/NCCL_SOCKET_IFNAME/GLOO_SOCKET_IFNAMEbefore launching multi-node runs.docs/node-smoke.md— full reference for the new smoke test (architecture, every report section, every flag, design history).docs/node-smoke-test-instruction.md(new) — short quick-start guide for node-smoke.How to use