feat(preflight): configurable perf tests + new node-local smoke test by amd-fuyuajin · Pull Request #712 · AMD-AGI/Primus

amd-fuyuajin · 2026-05-05T14:11:42Z

Summary

This PR adds two complementary cluster-diagnostic capabilities on top of the existing preflight tool, plus a comprehensive doc rewrite. The recommended pre-launch workflow becomes smoke first → preflight second:

Configurable preflight — the existing global-rendezvous tool gains per-test selection (--tests), tuning knobs (message sizes, group sizes, ring-P2P sizes), a --quick preset, and reliability flags (--dist-timeout-sec, --comm-cleanup-delay-sec).
New node-smoke — a distributed-rendezvous-free per-node screen that runs Tier 1 (always) + optional Tier 2 perf checks on every node in parallel under SLURM, returns one PASS/FAIL verdict per node, and writes SLURM-ready passing_nodes.txt / failing_nodes.txt. Implemented as a new sub-package primus/tools/preflight/node_smoke/ with its own wrapper runner/run_node_smoke_direct.sh.

What's changed

1. Configurable preflight perf tests

New flags: --tests, --comm-sizes-mb, --intra-comm-sizes-mb, --inter-comm-sizes-mb, --intra-group-sizes, --inter-group-sizes, --ring-p2p-sizes-mb, --quick, --dist-timeout-sec, --comm-cleanup-delay-sec.
Mode precedence (single rule): perf intent (--perf-test/--tests/--quick) wins over info selectors (--host/--gpu/--network); info-only mode never initializes torch.distributed.
Per-test wall-clock logging: [Primus:Preflight] <test> done in <T>s.
Validation runs before any rendezvous, so typos and bad sizes/group-sizes fail in seconds (not after a 120s NCCL init hang).
Report improvements: Node→Hostname legend at the top, compressed Node/Rank ranges, "Leader hostname" per group.
Backward-compat preserved: --check-host/--check-gpu/--check-network and --no-split-nodes-subgroup still work.

2. Node-local smoke test (new)

Tier 1 (always, ~5 s/GPU): per-GPU set_device + 256 MB alloc + tiny GEMM with isfinite() check, plus reused info collectors, dmesg recent-error scan, software-stack fingerprint, NIC/RDMA roll-call, host limits, GPU low-level (PCIe link / HBM / ECC / throttle), XGMI link matrix, clock skew + time-daemon health, foreign-process detection, tooling self-latency canary.
Tier 2 (optional, --tier2-perf): GEMM TFLOPS, HBM GB/s, local 8-GPU RCCL all-reduce GB/s with configurable thresholds.
Aggregator on NODE_RANK==0: cluster Markdown report with stable section ordering, per-node JSON, drift detection, and pass/fail txt outputs.
Resilience: per-GPU subprocesses with hard timeout (a stuck set_device is SIGKILL'd without affecting peers); short hostnames; PID-namespace-aware self-detection; /proc/<pid>/comm fallback when amd-smi process returns name="N/A" for kernel/system PIDs like gpuagent.
Aggregator report sections are individually try/except-wrapped so a bug in one section can't truncate the rest.
Implementation is a Python sub-package (collectors/, aggregator/, orchestrator.py, per_gpu.py, rccl_local.py, cli.py, ...). Single public entry: python -m primus.tools.preflight.node_smoke run|aggregate|_per_gpu.

3. Documentation

docs/preflight.md — rewritten as the comprehensive reference for the configurable preflight tool.
docs/preflight-direct.md — quick-start guide for runner/run_preflight_direct.sh. Adds:
- Top-of-file "Which test should I run?" comparison + 3-step smoke-then-preflight workflow.
- Per-tool minimum dependency install matrix (torch is the only hard requirement; markdown2/weasyprint only for PDFs; matplotlib only for --plot); explicitly notes requirements.txt is not necessary for these tools alone.
- 10 labeled example subsections (A-J) covering every configurable knob.
- One-line callout to verify NCCL_IB_HCA / NCCL_IB_GID_INDEX / NCCL_SOCKET_IFNAME / GLOO_SOCKET_IFNAME before launching multi-node runs.
docs/node-smoke.md — full reference for the new smoke test (architecture, every report section, every flag, design history).
docs/node-smoke-test-instruction.md (new) — short quick-start guide for node-smoke.

How to use

# 1) Prune broken nodes with node-smoke (fast, no rendezvous).
srun -N "$SLURM_NNODES" --ntasks-per-node=1 \
    bash runner/run_node_smoke_direct.sh --tier2-perf

# 2) Run preflight --quick on the survivors for cross-node sanity.
srun -N <good-nnodes> -c 128 --gpus-per-node=8 --ntasks-per-node=1 \
    --exclude=$(paste -sd, output/preflight/failing_nodes.txt) \
    runner/run_preflight_direct.sh --quick

…nd reports Introduce CLI flags to select which perf tests to run, override message and group sizes, and apply a fast pre-launch preset. Reorganize the report so large clusters stay readable, and harden the dispatcher with clear mode precedence and fail-fast validation. Test selection - Add --tests CSV with tokens: gemm, intra-allreduce, intra-alltoall, inter-allreduce, inter-alltoall, inter-p2p, inter-ring-p2p, all. - Replace the all-or-nothing run with a token-driven dispatch loop that logs per-test wall-clock time. - Add --quick preset (gemm + intra-allreduce + inter-allreduce, sizes 64,1024 MB, full intra/inter groups, lowered warmup/iters) for a fast pre-launch sanity check. Configurable sizes / groups - Add --comm-sizes-mb (global) and per-scope overrides --intra-comm-sizes-mb, --inter-comm-sizes-mb, --ring-p2p-sizes-mb. - Add --intra-group-sizes and --inter-group-sizes (supports 'all' token). - All knobs default to None so user-supplied values are detectable. Runtime warmup/iteration overrides - Add set_warmup/set_iteration/get_warmup/get_iteration in global_vars. - Update square_gemm, intra_node_comm, inter_node_comm, inter_node_comm_p2p, and inter_node_ring_p2p to read via the getters so --quick takes effect. Report readability - Add Node -> Hostname legend at the top of the perf report (also mirrored to console). - Rename the per-row Hostname column to "Leader hostname" and show only the first node of each group. - Use compact ranges for Node and Rank columns (e.g. 0-3, 0-15) via a new format_int_range helper. - Restore missing column headers in the console output for all perf tables. Mode precedence and safety - --tests and --quick auto-imply --perf-test. - Perf intent wins over info selectors: when --perf-test/--tests/--quick is mixed with --host/--gpu/--network, info selectors are dropped with an explicit WARN (stderr + markdown). - Tuning knobs set without any perf intent are inert and emit a quieter WARN ("knob X has no effect without --perf-test/--tests/--quick"). - Centralize PERF_INTENT_FLAGS and PERF_TEST_TOKENS in preflight_args. - Keep --no-split-nodes-subgroup as a deprecated alias. Validation hardening - _resolve_perf_config is now side-effect free; it returns warmup/iteration for the caller to apply instead of mutating module globals. - Validate intra/inter/ring knobs only when the corresponding tests are selected (e.g. --tests gemm --intra-group-sizes 3 no longer aborts). - Resolve and validate perf config BEFORE init_distributed so typos and bad sizes fail in milliseconds instead of after a 120s NCCL rendezvous. - Reject --tests values that yield zero valid tokens (e.g. ",,,") with a clear error instead of silently running no perf tests. - Drop unused format_host_range/_split_host_suffix and stale imports. Made-with: Cursor

…creening Add a lightweight, distributed-rendezvous-free smoke test that runs on every node in parallel under SLURM and quickly identifies broken nodes before a large-scale training job commits to a global rendezvous. Designed for the common case where we own full nodes and care which *node* is sick, not which GPU within an otherwise-healthy node. Architecture - primus/tools/preflight/node_smoke.py: per-node Python entry with three argparse subcommands (run, aggregate, _per_gpu). - runner/run_node_smoke_direct.sh: SLURM/bash wrapper, modeled after run_preflight_direct.sh. No MASTER_ADDR / no torch.distributed rendezvous; every node runs independently, NODE_RANK==0 aggregates. Tier 1 (mandatory, ~5 s/GPU) - For each GPU, spawn an isolated Python subprocess with a hard timeout that performs torch.cuda.set_device, a 256 MB allocation, and a tiny GEMM with an isfinite() check. Catches stale / hung GPUs that pass enumeration but fail the first real op. - Reuse existing collect_gpu_info / collect_host_info / collect_network_info(expect_distributed=False) and add a recent-dmesg scan for known hardware error patterns. Tier 2 (optional perf sanity, --tier2 / --tier2-rccl) - Per-GPU steady-state GEMM TFLOPS (8192^3 bf16) and HBM bandwidth measured via device-to-device torch.Tensor.copy_ (counts read+write), gated by thresholds. - Node-local 8-GPU all-reduce via torch.multiprocessing.spawn over a 127.0.0.1 process group, with a hard timeout. Measures algorithmic bandwidth at 64 MB. Iteration counts (warmup=5, iters=20 for GEMM and RCCL; warmup=10, iters=20 for HBM) are aligned with the preflight --quick preset so smoke and preflight report comparable steady-state numbers. Per-node JSON + cluster aggregation - Each node writes <dump>/smoke/<host>.json with status, fail_reasons, duration, tier1 (per-GPU details + system probes), and tier2 sections. - Aggregator on NODE_RANK==0 polls for the expected number of JSONs, then emits: * smoke_report.md with a status table, a Tier 2 perf summary (per-node GEMM / HBM min/median/max + local RCCL GB/s), and a "Failing nodes" detail section. * passing_nodes.txt / failing_nodes.txt suitable for piping straight into srun --nodelist / srun --exclude. Synthetic <missing-N> placeholders for nodes that never reported are kept in the markdown report but excluded from the txt files. - Aggregator returns non-zero if any node FAILs or the expected count is not met, so the wrapper script propagates a meaningful exit code. Verified at scale - Successful 6-node run on tus1-p3-g[14,15,25,26,27,29] with --tier2 --tier2-rccl: all nodes PASS, ~58 s wall clock per node, GEMM 702-733 TFLOPS, HBM 3.7-4.2 TB/s, local RCCL 197-201 GB/s. - Cross-checked formulas against square_gemm.py and intra_node_comm.py: identical AlgBW (2*S*(P-1)/P) and TFLOPS (2*N^3/t) definitions, so smoke and preflight numbers are directly comparable. Made-with: Cursor

…limit checks The node-local smoke test previously caught GPU-level failures (Tier 1) and optional perf regressions (Tier 2). It missed three of the most common "job dies at minute 3" causes at scale: software-stack drift between nodes, silently degraded RDMA NICs, and host limits that block RDMA pin / NCCL shared-memory under load. This commit adds all three in Tier 1 with no extra runtime (millisecond-scale sysfs reads). A. Software-stack fingerprint + cluster drift detection - New _collect_node_fingerprint() captures kernel, OS, Python, ROCm (/opt/rocm/.info/version), amdgpu driver (/sys/module/amdgpu/version), PyTorch + torch.version.hip, RCCL version (torch.cuda.nccl.version()), librccl.so path, plus per-IB-device firmware (fw_ver) and HCA model. - Aggregator computes the cluster-majority value for every scalar fingerprint key and emits a "Stack drift across cluster" section listing only outliers (e.g. one node on RCCL 2.21 while the rest are on 2.22). NIC firmware drift is reported per-IB-device in its own "NIC firmware drift" section so a flashed-differently NIC is named. - Healthy clusters render *All nodes match.* placeholders so the report stays short. B. NIC / RDMA roll-call (per-port, from sysfs only) - New _collect_nic_status() inventories every port under /sys/class/infiniband (no ibv_devinfo / ibstat dependency, works inside containers). Per port we capture state, phys_state, link rate, netdev + MTU, total non-zero GIDs, and the RoCE v2 GID subset. - Hard-fail rules (cause node FAIL): any port not ACTIVE / not LinkUp, any active port with zero RoCE v2 GIDs, or NIC count != the optional --expected-rdma-nics N. - Aggregator's "NIC / RDMA roll-call issues" table pinpoints the offending node + port + reason. C. Host limits / system tunables - New _collect_host_limits() captures RLIMIT_MEMLOCK, RLIMIT_NOFILE, RLIMIT_NPROC, /dev/shm size + free, NUMA node count, CPU count, and cpu0 scaling_governor. - Hard-fail rules: RLIMIT_MEMLOCK finite and below --ulimit-l-min-gb (default 32 GiB) -> "RDMA pin will fail under load"; /dev/shm size below --shm-min-gb (default 8 GiB) -> "NCCL shared-mem may fail". - Aggregator's "Host limits issues" section lists violators with the exact value and required threshold. Wiring + CLI - Collectors are invoked unconditionally in _cmd_run after the existing reused info collectors, stored under tier1.fingerprint / tier1.nics / tier1.host_limits in the per-node JSON. - _node_status_from() now adds nic: and host_limits: prefixed reasons so per-node fail_reasons remain self-describing. - New `run` flags: --expected-rdma-nics N FAIL on count mismatch (default: report only) --ulimit-l-min-gb GB FAIL threshold (default 32; 0 disables) --shm-min-gb GB FAIL threshold (default 8; 0 disables) - Wrapper script needs no changes; unknown flags are forwarded as-is. Verified - Live single-node run on tus1-p3-g25: fingerprint populated (ROCm 6.4.2, amdgpu 6.12.12, RCCL 2.28.9, NIC fw 231.2.63.0 across all 8 rdma devices); NIC roll-call shows 8/8 ports ACTIVE/LinkUp at 400 Gb/s, MTU 9000, >=1 RoCE v2 GID each, 0 issues; host limits show memlock 405 GiB, /dev/shm 1.6 TiB, governor=performance, 0 fail_reasons. All four new report sections render the *empty* placeholders cleanly. - Synthetic two-node drift test (one real + one edited copy): outlier node correctly surfaces in Stack drift (rccl, amdgpu_driver), NIC firmware drift (rdma3 only), NIC issues (rdma2:1 DOWN), and Host limits (memlock 64 MiB violation). Per-node fail_reasons and exit code propagate as expected. Made-with: Cursor

…dd port-count outlier section In _stack_drift_rows() the comparison-key set was populated whenever any node reported a key as a scalar OR as None. On a heterogeneous cluster (some nodes with an IB stack, some without) "nic_fw" is None on one node and a dict on the others. The dict then reached collections.Counter and crashed the aggregator with `TypeError: unhashable type: 'dict'`. The crash happened mid-write, so smoke_report.md was truncated and passing_nodes.txt / failing_nodes.txt were never produced -- so an 18-node SLURM run that successfully wrote 17 per-node JSONs ended up with no usable cluster verdict. Changes - _stack_drift_rows: only collect a key when at least one node reports it as a real scalar (drop the "None counts as scalar" path); plus a defense-in-depth isinstance check inside the per-host loop so the same crash is impossible if a future schema mixes scalar and dict for the same key. - Wrap each report section (Stack drift, NIC firmware drift, NIC issues, Host limits) in its own try/except. A bug in one section now records "*Section X failed to render: ...*" inline and the rest of the report still gets written. - Add a "NIC port-count summary" section that always renders, computes the cluster-majority port count, and lists every node that disagrees. This catches partial-NIC-degradation cases (e.g. one node enumerating 0 or 7 of 8 RDMA NICs) without requiring --expected-rdma-nics. Wrapped in try/except like the others. Verified - Local repro of the original failure (one node nic_fw=None, one node nic_fw=dict): aggregator now exits 0 and writes a complete report, with the port-count outlier surfaced in the new summary section. - Existing single-node and synthetic-drift smoke flows still produce the same output, including the empty-state placeholders on a homogeneous cluster.

… reported - Normalize host -> short name in _cmd_run (JSON filename + host field) and defensively in _cmd_aggregate so legacy FQDN JSONs produce SLURM-ready passing/failing txt files without re-running the smoke. - New `aggregate --expected-nodelist-file FILE`: missing nodes are named by their real short hostname (instead of <missing-N>) and written directly to failing_nodes.txt. - runner/run_node_smoke_direct.sh: rank 0 auto-populates the file from `scontrol show hostnames "$SLURM_JOB_NODELIST"`. Best-effort.

…ocm-smi self-latency Adds four new Tier 1 collectors and matching aggregator sections so the smoke test catches a broader class of "node will silently degrade training" failures before launch. Per-node collectors (one call per node, results in tier1.<key>): - gpu_low_level: amd-smi metric --json (text fallback) -> per-GPU power, GFX clock, edge temp, ECC counters, throttle status. Schema-tolerant. - xgmi: amd-smi topology -> parses the LINK TYPE TABLE into a BDF-indexed square matrix; records every non-XGMI pair. - clock: time.time(), monotonic, and systemctl is-active for chronyd/ntp/ntpd/systemd-timesyncd. - tooling: times rocm-smi --version against a hard timeout (default 5 s) -- a wedging amdgpu driver typically hangs rocm-smi for 30-60 s before the GPU itself stops responding. Per-GPU subprocess gain (D-1 light, sysfs + torch only, no shell-out): - details.low_level: pci_bdf, pcie_link_speed_{raw,gts}, pcie_link_width (from /sys/bus/pci/devices/<bdf>) plus hbm_total_bytes/hbm_free_bytes (torch.cuda.mem_get_info). New hard fails in _node_status_from: - per-GPU ecc_uncorrectable_total > 0 -> node FAIL. - any non-XGMI pair in the topology matrix -> node FAIL (intra-node collectives silently fall back to PCIe and lose 5-10x bandwidth). - rocm-smi --version timeout -> node FAIL (driver wedging signal). Throttle reasons and time-daemon health are recorded but not failed-on (schema is too vendor-specific / cluster-culture-specific for a default). New CLI flags: - run --rocm-smi-timeout-sec (default 5.0) - aggregate --rocm-smi-warn-sec (default 1.0) - aggregate --clock-skew-warn-sec (default 30.0; loose because the spread also includes srun launch jitter) New aggregator sections in smoke_report.md (each wrapped in its own try/except so a single bug can never truncate the report): - GPU low-level outliers (PCIe link / HBM): per-GPU values that diverge from the cluster majority, listed as host:gpu = value. - XGMI link issues: per-node, with up to 6 sample non-XGMI pairs each. - Cluster clock + time daemons: wall-clock spread (with earliest/latest hosts) plus a sub-table of any nodes with no active time-sync daemon. - Tooling self-latency: any node that hit the rocm-smi timeout (FAIL) or exceeded --rocm-smi-warn-sec. Verified locally: amd-smi metric/topology/rocm-smi calls all complete; XGMI parser handles the real multi-section BDF-labelled output (8x8 SELF on diagonal, all XGMI off-diagonal); end-to-end run + aggregate produces a clean smoke_report.md with all four new sections rendering cleanly.

…PU / stale-driver nodes can't PASS Previously, a node where torch.cuda.device_count() resolved to 0 could silently PASS smoke if `_collect_reused_info()` failed to surface the "No GPUs detected" finding -- e.g. when collect_gpu_info() raises and the wrapper downgrades the failure to level="warn". That's exactly the class of failure (stale ROCm install, wedged amdgpu driver) the smoke test exists to catch, so the FAIL must not depend on any other collector's behavior. Adds a self-contained guard in _cmd_run that captures every independent GPU-count source -- the --expected-gpus flag, LOCAL_WORLD_SIZE, GPUS_PER_NODE, torch.cuda.is_available(), torch.cuda.device_count(), and (after _collect_amd_smi_metrics) amd-smi -- into a new tier1.gpu_visibility block, with two hard-fail rules: 1. expected_gpus < 1 -> hard fail with full diagnostic context. 2. amd-smi sees more GPUs than torch -> hard fail. This is the high-signal stale-ROCm / wedged-driver signature. _node_status_from now prepends gpu_visibility:* reasons before any other check, so the visibility verdict is independent of the reused gpu_info collector. The aggregator gets a dedicated "## GPU visibility issues" section that surfaces expected / torch / amd-smi counts side by side per node. Verified locally: on a host where torch can't see GPUs but amd-smi sees 8, both reasons land in fail_reasons ahead of any reused-collector finding and the node correctly FAILs.

…ubstring The dmesg scanner used `p in ll` (substring match) over a list that included regex-looking patterns like "amdgpu.*error". As a result the amdgpu pattern essentially never fired against real kernel lines: amdgpu 0000:05:00.0: amdgpu_device_resume failed: -19 amdgpu: [drm] *ERROR* ring sdma0 timeout, signaled seq=12345 both slipped past the scan, defeating the dmesg check for the most common amdgpu failure modes. Switch to compiled regex matching with re.IGNORECASE. Patterns are documented as regex-by-contract; a malformed pattern is recorded into the dmesg block's `pattern_errors` field and never aborts the scan. Pattern changes: - "xid" -> r"\bxid\b" (avoid matching auxiliary etc.) - "amdgpu.*error" -> r"amdgpu.*(error|fail|timeout)" (real formats) - added r"\*error\*" (catches "[drm] *ERROR*") All previously-literal patterns ("hardware error", "gpu reset", "hung_task", "soft lockup", ...) work unchanged because they contain no regex metacharacters. Verified against real amdgpu / NVRM Xid / MCE / soft-lockup / hung_task / page-allocation lines (all match) and benign systemd / audit lines (none match).

…node FAIL

…mp cleanup Several real-cluster paper-cuts uncovered while running node-smoke on 4-8 node SLURM jobs. None of these change the diagnostic content of the report -- they fix surprising / wrong behaviour around the edges. 1. Per-node `run` now always exits 0 when the JSON was written. Previously it returned 1 whenever a node was diagnosed FAIL, so srun printed one "error: ... task N: Exited with exit code 1" line per bad node. That conflates "this node is broken" (a successful diagnosis) with "this tool crashed" (a real failure) and made it look like the smoke test itself was broken whenever it correctly identified a problem. The cluster-health verdict still flows out via the aggregator's exit code on rank 0 (single CI-friendly signal) and via failing_nodes.txt; tool-internal failures still propagate non-zero through Python's default exception handling. 2. Replace --tier2 + --tier2-rccl with a single --tier2-perf flag. The old pair allowed --tier2-rccl on its own to silently skip RCCL (because runtime required both flags), and --tier2 alone silently skipped RCCL on single-GPU nodes. Both gave false coverage confidence. --tier2-perf now turns on GEMM + HBM + node-local RCCL together. The `run` subparser uses allow_abbrev=False so an old `--tier2` left in a script errors out loudly instead of being silently prefix-matched to --tier2-perf. A warn is emitted up front if --tier2-perf is requested on a node with < 2 visible GPUs so the RCCL skip is never silent. 3. Robust PCIe BDF resolution (_resolve_gpu_bdf). torch.cuda.get_device_properties(i).pci_bus_id is polymorphic across PyTorch + ROCm versions: sometimes a canonical string, sometimes just the bus byte as int. The old code called .lower() on it and crashed inside a broad try/except, silently losing PCIe link width / speed and HBM totals from the report. The new helper handles both string and int forms, verifies sysfs paths, and the per-GPU low- level capture splits PCIe and HBM into independent try blocks with dedicated error keys so one missing piece never costs us the other. 4. Auto-clean stale artifacts in --dump-path on rank 0 at startup (_clean_dump_path), with --no-clean-dump-path to opt out. Without this, a re-run on a smaller nodelist would leave per-node JSONs from removed nodes in <dump>/smoke/ and the aggregator would happily count them as PASS, contaminating the report. Cleanup is rank-0 only and runs before any rank can have written its current-run JSON, so it is race-safe. 5. runner/run_node_smoke_direct.sh: docstring updated to mention --tier2-perf instead of the removed --tier2 / --tier2-rccl.

…GPUs The single most common reason a "healthy" cluster fails to launch a large training job is that a previous job's Python ranks are still attached to the GPUs (held HBM, half-torn-down NCCL communicators, or just stuck in __del__). Symptoms in the new job: torch.cuda.OutOfMemoryError at model init with a misleading "free=Y" message, NCCL/RCCL bootstrap hang, or random ranks failing the first all-reduce due to compute contention. This commit adds three Tier 1 checks (all node-level, all run before any per-GPU subprocess attaches to the device, so we only see foreign work): 1. Foreign / leaked process enumeration -- _collect_gpu_processes() Tries `amd-smi process --json` -> `amd-smi process` (text) -> `lsof /dev/kfd /dev/dri/renderD*` and records {pid, name, hbm_bytes, is_self, is_allowed, is_foreign} per GPU. A PID is treated as ours (and excluded) if its pgid matches our own; everything else is foreign unless its name is in --allowed-procs (e.g. "rocm-smi-daemon,amd-smi,dcgm-exporter"). Hard FAIL by default; --allow-foreign-procs downgrades to report-only. 2. Pre-touch HBM-busy check -- in _per_gpu_body torch.cuda.mem_get_info is now called BEFORE we allocate anything on the GPU, so the "used" reading reflects only foreign occupancy. Hard FAIL if any GPU has > --hbm-busy-threshold-gib (default 2.0) used at that point. The previous post-test reading is biased by PyTorch's caching allocator (which doesn't truly release pages on empty_cache()) and was therefore not safe to threshold-check. 3. GPU compute-activity warn -- gfx_activity_pct in _flatten_amd_smi_metric_json Surfaces gpus reporting >= --gpu-activity-warn-pct (default 20%) at smoke start. Warn-only because short bursts are normal, but a sustained pegged-100% across multiple GPUs strongly indicates a leaked rank still running compute. Aggregator output (smoke_report.md): ## Busy GPUs / leaked processes | Node | Hostname | GPU | PID | Process | HBM held (GiB) | ## GPU pre-touch HBM usage outliers | Node | Hostname | GPU | HBM used pre-touch (GiB) | ## GPU compute-activity outliers | Node | Hostname | GPU | Activity % | failing_nodes.txt now includes any node with a foreign GPU process or excessive pre-touch HBM, so the operator can `srun --exclude=` them or `pkill -9 -f train.py` and retry. New CLI flags (run): --hbm-busy-threshold-gib N FAIL if pre-touch HBM used > N GiB. Default 2.0. --allow-foreign-procs Downgrade foreign-process FAIL to report-only. --allowed-procs name1,name2 Whitelist known agents. --gpu-activity-warn-pct N Aggregator warn threshold. Default 20. The same threshold flags are mirrored on `aggregate` so the report labels its sections with the numbers each node was configured with, and on the internal _per_gpu subcommand so the spawned subprocess receives --hbm-busy-threshold-gib. Verified: - Real 8-GPU node, no foreign processes -> sections render with reassuring "no issues" text; gpu_processes.tool == "amd-smi process --json", foreign_count == 0. - Synthetic JSON with 2 foreign PIDs + 1 pre-touch outlier + 2 active GPUs -> all three tables populate; idle/clean GPUs filtered out. - _node_status_from default -> precise FAIL message with PID/name/HBM; --allow-foreign-procs -> no FAIL (still in report).

…d amd-smi process JSON parser Two related gaps in busy-GPU / leaked-process detection: (1) checks silently no-op'd when amd-smi was missing, and (2) on nodes where amd-smi is present, the modern (>=6.x) `amd-smi process --json` schema broke our parser so the operator-facing "who is holding the GPU" table came back empty -- even though pre-touch HBM had correctly flagged the node as FAIL. Tooling availability + rocm-smi fallbacks ----------------------------------------- - Inventory amd-smi / rocm-smi / lsof at runtime; emit a loud WARN on rank 0 listing exactly which checks lose coverage. - Always-on "Tooling availability" section in the aggregator report, with per-tool presence and per-check fallback status. - `run --require-tools <csv>` promotes missing required tools to a hard node FAIL for strict CI environments. - Add four rocm-smi fallback parsers producing the same per-GPU schema as their amd-smi counterparts: * `_rocm_smi_ras_info_text` -> ECC counters * `_rocm_smi_topotype_json` -> XGMI link matrix * `_rocm_smi_processes` -> foreign processes (--showpids) * `_rocm_smi_use_json` -> gfx_activity_pct (--showuse) Wired into `_collect_amd_smi_metrics`, `_collect_xgmi_topology`, and `_collect_gpu_processes` so coverage stays close to full when only rocm-smi is installed. - Default `--allowed-procs` now includes node-resident agents (`gpuagent`, `rocm-smi-daemon`, `amd-smi`, `dcgm-exporter`). amd-smi process JSON parser fix ------------------------------- Real `amd-smi process --json` output (verified on a busy MI300X) is double-nested in two ways the old parser didn't handle: [{"gpu": 0, "process_list": [ {"process_info": { <-- extra wrapper "pid": 2669301, "memory_usage": { "vram_mem": {"value": 23044481024, "unit": "B"} <-- dict } }} ]}] The old code did `p.get("pid")` directly on the `{"process_info": ...}` wrapper -> got None -> silently dropped every process. Even if it had reached `_hbm_of`, the dict-with-unit memory shape wasn't recognised. Net effect: `gpu_processes.foreign_count == 0` on nodes that visibly had 8x ~23 GB leaked python ranks holding HBM. - New `_unwrap_proc()` peels off `process_info` if present, so modern and older amd-smi shapes flow through one path. - New `_value_unit_to_bytes()` resolves int / formatted string / `{"value": N, "unit": "..."}` uniformly via `_parse_size_with_unit`. - Updated docstring to record all three real-world shapes (modern A, older flat A', per-process B). rocm-smi --showpids parser also extracts VRAM bytes --------------------------------------------------- The documented field order is `name, num_gpus, vram_bytes, sdma, cu_occupancy`. We were only taking field 0 (name) and passing None for hbm_bytes, so even when the rocm-smi fallback fired the operator could see which PIDs were leaked but not how much VRAM each was holding. Now also takes field 2 as VRAM bytes (best-effort; tolerates older shapes and dict/list values). Verified against real captures from a busy node: amd-smi process --json: 0 PIDs (before) -> 8 PIDs flagged foreign, ~23.04 GB / 21.46 GiB each rocm-smi --showpids : 10 PIDs, no HBM -> 10 PIDs, 8 python3.11 foreign ~22.8 GiB each, 2 gpuagent allowed

amd-smi process, rocm-smi --showpids, and lsof /dev/kfd report PIDs in the **root (host) PID namespace** -- KFD knows nothing about user namespaces. os.getpid() returns the PID *as our own namespace sees it*. On bare metal or SLURM + pyxis/enroot (shared host PID ns by default) the two are equal and the naive `reported_pid == os.getpid()` test in `_collect_gpu_processes` works. Inside Docker (default) or any k8s pod the two differ -- causing our own training rank to be flagged `is_foreign=True` and (with the default policy) failing the node. Fix: new `_resolve_self_pid_view()` parses the NSpid line in /proc/self/status to recover our root-namespace PID. The matcher in `_collect_gpu_processes` now uses that host-side PID directly. The pgid-match path is preserved on bare metal but skipped inside a private PID namespace (os.getpgid on host PIDs we cannot see would always ESRCH). Output JSON gains `self_host_pid`, `pid_namespaced`, and the full `ns_pid_chain` for forensics across container boundaries. Verified: bare metal, private PID ns w/ own rank + leak, private PID ns w/ leak + allowed agent -- all classify correctly. The private-PID-ns-w/-own-rank case was the bug (previously foreign=2, now foreign=1). Net effect: zero behavior change on SLURM + pyxis/enroot; own rank no longer false-flagged on Docker / k8s.

The 4,487-line node_smoke.py is now a node_smoke/ package with one module per responsibility, while preserving: - the `python -m primus.tools.preflight.node_smoke` entry point - CLI flags, help text, and exit-code semantics for run/_per_gpu/aggregate - JSON schema/keys and markdown report section order - runner/run_node_smoke_direct.sh wrapper behavior Layout: types.py, logging_utils.py, shell_utils.py leaf helpers per_gpu.py, rccl_local.py in-process workloads collectors/ per-area data gatherers (dmesg, fingerprint, nics, host_limits, gpu_low_level, gpu_processes, xgmi, clock, rocm_smi, tooling, reused_info) orchestrator.py spawn _per_gpu + status roll-up aggregator/summarizers.py row/summary helpers aggregator/report.py markdown writer, one helper per ## section (was ~700-line block) cli.py argparse + run/_per_gpu/aggregate tests/test_node_smoke.py 22 unit + parity tests Tier 2 perf summary and "Failing nodes -- full reasons" keep their existing error-handling exactly (no new try/except). Verified end-to-end with the entrypoint matrix and a JSON/markdown diff against a pre-refactor baseline (time-variant fields allow-listed). Docs: docs/node-smoke.md updated with the new module layout, dependency diagram, refreshed flag tables, and a design-overview entry in History.

…mi reports 'N/A' Some amd-smi / rocm-smi builds emit `name="N/A"` (or "", "none", "-", "unknown", ...) for kernel/system-owned PIDs like `gpuagent` because they cannot read /proc/<pid>/comm themselves. The allowlist check is purely name-based, so these placeholders never matched any whitelisted name and every healthy node with a running gpuagent was incorrectly FAILed with `gpu_processes: foreign process(es) holding GPU(s) ... name='N/A'`. Fall back to /proc/<pid>/comm (then /proc/<pid>/status `Name:`) inside `_annotate` whenever the upstream name is a known placeholder, then re-evaluate `is_allowed` against the resolved name. Original placeholder is preserved on the per-process record under `name_raw` plus a boolean `name_resolved_from_proc` so the JSON keeps the audit trail. Side benefit: the report's "Busy GPUs / leaked processes" table now shows real names ('python', 'gpuagent', ...) instead of 'N/A', so operators can finally see what to pkill on nodes with actual leaked ranks. Co-authored-by: Cursor <cursoragent@cursor.com>

…art, recommend smoke-then-preflight workflow - preflight.md: rewrite as the comprehensive reference for the now-configurable preflight tool. Cover info-only / perf-only / default mode precedence; --tests token list; --quick preset substitutions; per-knob tuning of message sizes, group sizes, ring-P2P sizes, and plotting; reliability knobs (--comm-cleanup-delay-sec, --dist-timeout-sec); validation behavior (fail-fast before NCCL init); reporting flags; backward-compat aliases; comparison with node-smoke; recommended pre-launch sequence. - preflight-direct.md: turn into a quick-start guide. Add a top-of-file "Which test should I run?" comparison plus a 3-step smoke-then-preflight workflow snippet. Replace the monolithic example block with 10 labeled subsections (A-J) covering every configurable knob. Document the minimum-dependency install matrix per tool/mode (torch is the only hard requirement; markdown2/weasyprint only for PDFs; matplotlib only for --plot); flag the existing requirements.txt path as 'full Primus runtime' and not necessary for preflight/smoke alone. Add a one-line callout before the multi-node example to verify NCCL_IB_HCA / NCCL_IB_GID_INDEX / NCCL_SOCKET_IFNAME / GLOO_SOCKET_IFNAME before launching. - node-smoke-test-instruction.md (new): short get-started guide for node-smoke, organized as quick-start + 10 example subsections per configurable knob + outputs + cheat sheet + troubleshooting. Links back to node-smoke.md for the full reference. - node-smoke.md: refine the opening paragraph to drop the awkward "GPU vs node" framing (training jobs allocate whole nodes anyway, so node-granularity verdicts are the right unit). Add a callout pointing newcomers at node-smoke-test-instruction.md.

amd-ama10002-2

LGTM overall — approving. Nice work on the smoke test, and the new unit tests are a great addition. Two follow-up requests below; happy for these to land in a separate PR if it's easier.

1. I think the new unit tests aren't actually put into the CI?

The test file lives at primus/tools/preflight/node_smoke/tests/test_node_smoke.py (in-source), but our CI only points pytest at ./tests/?

2. I'm not sure if the tests cover all of the new and updated features

I didn't manually test the new features but I think it would be good in the future to make it easy to test our features so that we avoid introducing new bugs.

Also, a nit, I tend to prefer having 1 new or updated feature == 1 PR, so that we have small, frequent PRs. I find this easier to do PR reviews and it lets us merge features iteratively instead of in a large batch. In this example, I probably would've liked this to be multiple smaller PRs 🤷‍♂️

Just something to consider for the future🙂. We don't need to spend time breaking up this PR at all 🙂👍

Thanks again for the work!

…s / HBM-busy checks Two related hardenings to the per-node smoke test so a sick node can no longer slip through as clean: - gpu_processes.py: previously, when `amd-smi process --json` returned rc=0 with valid JSON but an unknown / future schema, the parser returned [] and the caller still set ok=True, foreign_count=0 -- the rocm-smi / lsof fallbacks never ran. The empty-result case was indistinguishable from "schema matched, no processes" because `_flatten_amd_smi_process_json` only registered a per-GPU bucket when it pushed at least one process. Fix: register the per-GPU bucket as soon as a Shape A / A' entry is recognized (even with an empty `process_list`). Now [] unambiguously means "schema didn't match", and `_collect_gpu_processes` gates ok=True on a non-empty parsed result, falling through to text / rocm-smi / lsof on schema mismatch and recording a clear json_parse_error. - per_gpu.py: tighten the pre-touch HBM-busy check from `used_b > hbm_busy_threshold_bytes` to `>=`, so a GPU sitting exactly at the threshold is treated as busy (likely leaked from a previous job) instead of squeaking through.

amd-fuyuajin · 2026-05-06T14:21:47Z

@amd-ama10002-2 Thanks for your suggestions.

regarding the CI, I did not want to put the unit tests in the CI because I thought this is still a feature branch. Maybe we can add that when we merge the branch into main.
regarding the smaller PR, you suggestion is definitely better practice. I will follow in the future.

amd-fuyuajin · 2026-05-06T14:28:48Z

Here is an example of the node smoke test summary report. The output also include per-node json files with detailed info, a passing node list and a failing node list, which can be used to exclude in the following srun command. They are not shown here.

Node-Local Smoke Test Report

Expected nodes: 16
Reported nodes: 16
PASS: 15 FAIL: 1

Node	Hostname	Status	Duration	Top fail reason
0	tus1-p3-g2	PASS	24.787s
1	tus1-p3-g14	PASS	23.053s
2	tus1-p3-g15	PASS	24.229s
3	tus1-p3-g25	FAIL	23.038s	gpu0: FAIL: pre-touch HBM busy: 213.14 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previou...
4	tus1-p3-g26	PASS	28.08s
5	tus1-p3-g27	PASS	23.741s
6	tus1-p3-g29	PASS	25.238s
7	tus1-p3-g32	PASS	23.82s
8	tus1-p3-g50	PASS	23.867s
9	tus1-p3-g51	PASS	23.707s
10	tus1-p3-g52	PASS	23.531s
11	tus1-p3-g53	PASS	24.463s
12	tus1-p3-g54	PASS	24.733s
13	tus1-p3-g55	PASS	24.305s
14	tus1-p3-g57	PASS	24.552s
15	tus1-p3-g59	PASS	24.727s

Stack drift across cluster

Key	Majority (count/total)	Outlier nodes
`rocm`	`6.4.2-120` (15/16)	`tus1-p3-g57` = `7.2.0`

NIC firmware drift across cluster

All NIC firmwares match (or no NICs reported).

NIC / RDMA roll-call issues

No NIC issues.

NIC port-count summary

Cluster-majority port count: 8 (seen on 16/16 nodes).

Every node reports the majority count.

Host limits issues

No host-limit issues.

GPU visibility issues

Every node resolved expected_gpus >= 1 and torch + amd-smi agree on the GPU count.

GPU low-level outliers (PCIe link / HBM)

All GPUs match the cluster majority on PCIe link and HBM total.

XGMI link issues

All GPU pairs report XGMI on every node (or amd-smi topology was unavailable).

Cluster clock + time daemons

Wall-clock spread across 16 nodes: 1.555s.
Earliest: tus1-p3-g25, latest: tus1-p3-g54.
(Spread is an upper bound on real clock skew -- it also includes srun launch jitter.)

Every node has at least one active time-sync daemon.

Tooling self-latency (`rocm-smi --version`)

No nodes exceeded the warn threshold (1.0s) and no timeouts.

Tooling availability

Every tracked tool (amd-smi, rocm-smi, lsof) was present in PATH on every node.

Busy GPUs / leaked processes

Foreign PIDs found holding GPUs at smoke start. The most common cause is leaked Python ranks from a previous training job (look for python / torchrun / train.py). Clean up with pkill -9 -f train.py (or similar) on the listed nodes BEFORE launching the next job.

Node	Hostname	GPU	PID	Process	HBM held (GiB)
3	tus1-p3-g25	0	987322	`python3`	?
3	tus1-p3-g25	0	987790	`sglang::schedul`	?
3	tus1-p3-g25	0	987791	`sglang::schedul`	211.64
3	tus1-p3-g25	0	987792	`sglang::schedul`	?
3	tus1-p3-g25	0	987793	`sglang::schedul`	?
3	tus1-p3-g25	0	987794	`sglang::schedul`	?
3	tus1-p3-g25	0	987795	`sglang::schedul`	?
3	tus1-p3-g25	0	987796	`sglang::schedul`	?
3	tus1-p3-g25	0	987797	`sglang::schedul`	?
3	tus1-p3-g25	0	987798	`sglang::detoken`	?
3	tus1-p3-g25	1	987322	`python3`	?
3	tus1-p3-g25	1	987790	`sglang::schedul`	?
3	tus1-p3-g25	1	987791	`sglang::schedul`	?
3	tus1-p3-g25	1	987792	`sglang::schedul`	?
3	tus1-p3-g25	1	987793	`sglang::schedul`	212.48
3	tus1-p3-g25	1	987794	`sglang::schedul`	?
3	tus1-p3-g25	1	987795	`sglang::schedul`	?
3	tus1-p3-g25	1	987796	`sglang::schedul`	?
3	tus1-p3-g25	1	987797	`sglang::schedul`	?
3	tus1-p3-g25	1	987798	`sglang::detoken`	?
3	tus1-p3-g25	2	987322	`python3`	?
3	tus1-p3-g25	2	987790	`sglang::schedul`	?
3	tus1-p3-g25	2	987791	`sglang::schedul`	?
3	tus1-p3-g25	2	987792	`sglang::schedul`	212.47
3	tus1-p3-g25	2	987793	`sglang::schedul`	?
3	tus1-p3-g25	2	987794	`sglang::schedul`	?
3	tus1-p3-g25	2	987795	`sglang::schedul`	?
3	tus1-p3-g25	2	987796	`sglang::schedul`	?
3	tus1-p3-g25	2	987797	`sglang::schedul`	?
3	tus1-p3-g25	2	987798	`sglang::detoken`	?
3	tus1-p3-g25	3	987322	`python3`	?
3	tus1-p3-g25	3	987790	`sglang::schedul`	212.44
3	tus1-p3-g25	3	987791	`sglang::schedul`	?
3	tus1-p3-g25	3	987792	`sglang::schedul`	?
3	tus1-p3-g25	3	987793	`sglang::schedul`	?
3	tus1-p3-g25	3	987794	`sglang::schedul`	?
3	tus1-p3-g25	3	987795	`sglang::schedul`	?
3	tus1-p3-g25	3	987796	`sglang::schedul`	?
3	tus1-p3-g25	3	987797	`sglang::schedul`	?
3	tus1-p3-g25	3	987798	`sglang::detoken`	?
3	tus1-p3-g25	4	987322	`python3`	?
3	tus1-p3-g25	4	987790	`sglang::schedul`	?
3	tus1-p3-g25	4	987791	`sglang::schedul`	?
3	tus1-p3-g25	4	987792	`sglang::schedul`	?
3	tus1-p3-g25	4	987793	`sglang::schedul`	?
3	tus1-p3-g25	4	987794	`sglang::schedul`	?
3	tus1-p3-g25	4	987795	`sglang::schedul`	212.35
3	tus1-p3-g25	4	987796	`sglang::schedul`	?
3	tus1-p3-g25	4	987797	`sglang::schedul`	?
3	tus1-p3-g25	4	987798	`sglang::detoken`	?
3	tus1-p3-g25	5	987322	`python3`	?
3	tus1-p3-g25	5	987790	`sglang::schedul`	?
3	tus1-p3-g25	5	987791	`sglang::schedul`	?
3	tus1-p3-g25	5	987792	`sglang::schedul`	?
3	tus1-p3-g25	5	987793	`sglang::schedul`	?
3	tus1-p3-g25	5	987794	`sglang::schedul`	?
3	tus1-p3-g25	5	987795	`sglang::schedul`	?
3	tus1-p3-g25	5	987796	`sglang::schedul`	?
3	tus1-p3-g25	5	987797	`sglang::schedul`	212.2
3	tus1-p3-g25	5	987798	`sglang::detoken`	?
3	tus1-p3-g25	6	987322	`python3`	?
3	tus1-p3-g25	6	987790	`sglang::schedul`	?
3	tus1-p3-g25	6	987791	`sglang::schedul`	?
3	tus1-p3-g25	6	987792	`sglang::schedul`	?
3	tus1-p3-g25	6	987793	`sglang::schedul`	?
3	tus1-p3-g25	6	987794	`sglang::schedul`	?
3	tus1-p3-g25	6	987795	`sglang::schedul`	?
3	tus1-p3-g25	6	987796	`sglang::schedul`	212.23
3	tus1-p3-g25	6	987797	`sglang::schedul`	?
3	tus1-p3-g25	6	987798	`sglang::detoken`	?
3	tus1-p3-g25	7	987322	`python3`	?
3	tus1-p3-g25	7	987790	`sglang::schedul`	?
3	tus1-p3-g25	7	987791	`sglang::schedul`	?
3	tus1-p3-g25	7	987792	`sglang::schedul`	?
3	tus1-p3-g25	7	987793	`sglang::schedul`	?
3	tus1-p3-g25	7	987794	`sglang::schedul`	212.19
3	tus1-p3-g25	7	987795	`sglang::schedul`	?
3	tus1-p3-g25	7	987796	`sglang::schedul`	?
3	tus1-p3-g25	7	987797	`sglang::schedul`	?
3	tus1-p3-g25	7	987798	`sglang::detoken`	?

GPU pre-touch HBM usage outliers

GPUs with more than 2.0 GiB of HBM already in use BEFORE smoke touched the device. This number is not polluted by our own caching allocator (it's measured before any allocation), so it directly reflects foreign or leaked occupancy.

Node	Hostname	GPU	HBM used pre-touch (GiB)
3	tus1-p3-g25	0	213.14
3	tus1-p3-g25	1	212.34
3	tus1-p3-g25	2	213.17
3	tus1-p3-g25	3	213.18
3	tus1-p3-g25	4	212.89
3	tus1-p3-g25	5	213.04
3	tus1-p3-g25	6	212.92
3	tus1-p3-g25	7	212.9

GPU compute-activity outliers

No GPU exceeded gfx_activity_pct >= 20.0% at smoke start (or amd-smi did not report activity).

Tier 2 perf summary

Per-node GEMM TFLOPS (8192^3 bf16) and HBM GB/s shown as min / median / max across the node's GPUs. RCCL GB/s is the node-local 8-GPU all-reduce algorithmic bandwidth at 64 MB.

Node	Hostname	GEMM TFLOPS (min/med/max)	HBM GB/s (min/med/max)	Local RCCL GB/s
0	tus1-p3-g2	756.5 / 760.0 / 762.1	4406.3 / 4421.6 / 4449.7	268.6
1	tus1-p3-g14	748.5 / 762.7 / 764.7	4399.8 / 4437.2 / 4534.7	271.1
2	tus1-p3-g15	750.4 / 757.9 / 764.4	4359.1 / 4417.6 / 4423.9	269.9
3	tus1-p3-g25			269.8
4	tus1-p3-g26	761.3 / 764.6 / 768.2	4388.6 / 4425.3 / 4435.8	269.2
5	tus1-p3-g27	745.9 / 756.4 / 764.4	4409.3 / 4419.7 / 4429.6	269.2
6	tus1-p3-g29	756.2 / 764.6 / 769.5	4395.3 / 4419.5 / 4442.0	268.8
7	tus1-p3-g32	752.1 / 763.7 / 769.6	4402.3 / 4411.7 / 4430.1	270.0
8	tus1-p3-g50	754.2 / 762.5 / 767.3	4359.9 / 4424.3 / 4477.9	268.8
9	tus1-p3-g51	749.4 / 765.8 / 768.5	4404.1 / 4417.2 / 4445.4	268.8
10	tus1-p3-g52	747.0 / 762.8 / 771.7	4401.7 / 4418.7 / 4453.4	270.3
11	tus1-p3-g53	748.9 / 761.9 / 768.2	4399.0 / 4419.0 / 4446.0	267.6
12	tus1-p3-g54	758.9 / 762.3 / 764.8	4403.9 / 4428.4 / 4444.6	270.4
13	tus1-p3-g55	751.6 / 762.5 / 769.5	4408.9 / 4422.0 / 4432.9	269.9
14	tus1-p3-g57	748.3 / 759.0 / 768.0	4399.3 / 4414.6 / 4427.6	270.1
15	tus1-p3-g59	750.0 / 760.5 / 762.9	4407.6 / 4422.7 / 4431.4	269.8

Failing nodes -- full reasons

tus1-p3-g25

gpu0: FAIL: pre-touch HBM busy: 213.14 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
gpu1: FAIL: pre-touch HBM busy: 212.34 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
gpu2: FAIL: pre-touch HBM busy: 213.17 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
gpu3: FAIL: pre-touch HBM busy: 213.18 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
gpu4: FAIL: pre-touch HBM busy: 212.89 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
gpu5: FAIL: pre-touch HBM busy: 213.04 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
gpu6: FAIL: pre-touch HBM busy: 212.92 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
gpu7: FAIL: pre-touch HBM busy: 212.9 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
gpu_processes: 80 foreign process(es) holding GPU(s) (e.g. gpu0: pid=987322 name='python3'; gpu0: pid=987790 name='sglang::schedul'; gpu0: pid=987791 name='sglang::schedul' hbm=211.64GiB) -- likely leaked rank(s) from a previous job. Clean up with pkill -9 -f train.py (or similar) or pass --allow-foreign-procs.

amd-fuyuajin and others added 20 commits May 4, 2026 18:40

fixed GPU topology check command, should be amd-smi topology

cc2556c

added node-smoke.md

57331cf

correct a doc string; any non-zero uncorrectable ECC count becomes a …

9f99195

…node FAIL

make NIC roll-call vendor- and fabric-agnostic

c9c6db1

fix _parse_size_with_unit function

f4a1b0a

amd-fuyuajin requested review from Xiaoming-AMD, limou102 and wenxie-amd as code owners May 5, 2026 14:11

amd-fuyuajin requested review from amd-ama10002-2 and yeandy May 5, 2026 14:13

amd-fuyuajin added 3 commits May 5, 2026 14:20

fix code-lint check error. added shellcheck disable=SC2317

3cc7e68

fix code-lint check error

794c127

chore: apply isort + black auto-fixes

96b9839

amd-ama10002-2 reviewed May 5, 2026

View reviewed changes

Comment thread primus/tools/preflight/node_smoke/collectors/gpu_processes.py

amd-ama10002-2 reviewed May 5, 2026

View reviewed changes

Comment thread primus/tools/preflight/node_smoke/per_gpu.py Outdated

amd-ama10002-2 approved these changes May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(preflight): configurable perf tests + new node-local smoke test#712

feat(preflight): configurable perf tests + new node-local smoke test#712
amd-fuyuajin wants to merge 24 commits intodev/preflight-direct-testfrom
dev/preflight-configurable-test

amd-fuyuajin commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

amd-ama10002-2 left a comment

Uh oh!

amd-fuyuajin commented May 6, 2026

Uh oh!

amd-fuyuajin commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amd-fuyuajin commented May 5, 2026

Summary

What's changed

1. Configurable preflight perf tests

2. Node-local smoke test (new)

3. Documentation

How to use

Uh oh!

Uh oh!

Uh oh!

amd-ama10002-2 left a comment

Choose a reason for hiding this comment

1. I think the new unit tests aren't actually put into the CI?

2. I'm not sure if the tests cover all of the new and updated features

Uh oh!

amd-fuyuajin commented May 6, 2026

Uh oh!

amd-fuyuajin commented May 6, 2026

Here is an example of the node smoke test summary report. The output also include per-node json files with detailed info, a passing node list and a failing node list, which can be used to exclude in the following srun command. They are not shown here.

Node-Local Smoke Test Report

Stack drift across cluster

NIC firmware drift across cluster

NIC / RDMA roll-call issues

NIC port-count summary

Host limits issues

GPU visibility issues

GPU low-level outliers (PCIe link / HBM)

XGMI link issues

Cluster clock + time daemons

Tooling self-latency (rocm-smi --version)

Tooling availability

Busy GPUs / leaked processes

GPU pre-touch HBM usage outliers

GPU compute-activity outliers

Tier 2 perf summary

Failing nodes -- full reasons

tus1-p3-g25

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tooling self-latency (`rocm-smi --version`)