Skip to content

feat(preflight): configurable perf tests + new node-local smoke test#712

Open
amd-fuyuajin wants to merge 24 commits intodev/preflight-direct-testfrom
dev/preflight-configurable-test
Open

feat(preflight): configurable perf tests + new node-local smoke test#712
amd-fuyuajin wants to merge 24 commits intodev/preflight-direct-testfrom
dev/preflight-configurable-test

Conversation

@amd-fuyuajin
Copy link
Copy Markdown
Collaborator

Summary

This PR adds two complementary cluster-diagnostic capabilities on top of the existing preflight tool, plus a comprehensive doc rewrite. The recommended pre-launch workflow becomes smoke first → preflight second:

  1. Configurable preflight — the existing global-rendezvous tool gains per-test selection (--tests), tuning knobs (message sizes, group sizes, ring-P2P sizes), a --quick preset, and reliability flags (--dist-timeout-sec, --comm-cleanup-delay-sec).
  2. New node-smoke — a distributed-rendezvous-free per-node screen that runs Tier 1 (always) + optional Tier 2 perf checks on every node in parallel under SLURM, returns one PASS/FAIL verdict per node, and writes SLURM-ready passing_nodes.txt / failing_nodes.txt. Implemented as a new sub-package primus/tools/preflight/node_smoke/ with its own wrapper runner/run_node_smoke_direct.sh.

What's changed

1. Configurable preflight perf tests

  • New flags: --tests, --comm-sizes-mb, --intra-comm-sizes-mb, --inter-comm-sizes-mb, --intra-group-sizes, --inter-group-sizes, --ring-p2p-sizes-mb, --quick, --dist-timeout-sec, --comm-cleanup-delay-sec.
  • Mode precedence (single rule): perf intent (--perf-test/--tests/--quick) wins over info selectors (--host/--gpu/--network); info-only mode never initializes torch.distributed.
  • Per-test wall-clock logging: [Primus:Preflight] <test> done in <T>s.
  • Validation runs before any rendezvous, so typos and bad sizes/group-sizes fail in seconds (not after a 120s NCCL init hang).
  • Report improvements: Node→Hostname legend at the top, compressed Node/Rank ranges, "Leader hostname" per group.
  • Backward-compat preserved: --check-host/--check-gpu/--check-network and --no-split-nodes-subgroup still work.

2. Node-local smoke test (new)

  • Tier 1 (always, ~5 s/GPU): per-GPU set_device + 256 MB alloc + tiny GEMM with isfinite() check, plus reused info collectors, dmesg recent-error scan, software-stack fingerprint, NIC/RDMA roll-call, host limits, GPU low-level (PCIe link / HBM / ECC / throttle), XGMI link matrix, clock skew + time-daemon health, foreign-process detection, tooling self-latency canary.
  • Tier 2 (optional, --tier2-perf): GEMM TFLOPS, HBM GB/s, local 8-GPU RCCL all-reduce GB/s with configurable thresholds.
  • Aggregator on NODE_RANK==0: cluster Markdown report with stable section ordering, per-node JSON, drift detection, and pass/fail txt outputs.
  • Resilience: per-GPU subprocesses with hard timeout (a stuck set_device is SIGKILL'd without affecting peers); short hostnames; PID-namespace-aware self-detection; /proc/<pid>/comm fallback when amd-smi process returns name="N/A" for kernel/system PIDs like gpuagent.
  • Aggregator report sections are individually try/except-wrapped so a bug in one section can't truncate the rest.
  • Implementation is a Python sub-package (collectors/, aggregator/, orchestrator.py, per_gpu.py, rccl_local.py, cli.py, ...). Single public entry: python -m primus.tools.preflight.node_smoke run|aggregate|_per_gpu.

3. Documentation

  • docs/preflight.md — rewritten as the comprehensive reference for the configurable preflight tool.
  • docs/preflight-direct.md — quick-start guide for runner/run_preflight_direct.sh. Adds:
    • Top-of-file "Which test should I run?" comparison + 3-step smoke-then-preflight workflow.
    • Per-tool minimum dependency install matrix (torch is the only hard requirement; markdown2/weasyprint only for PDFs; matplotlib only for --plot); explicitly notes requirements.txt is not necessary for these tools alone.
    • 10 labeled example subsections (A-J) covering every configurable knob.
    • One-line callout to verify NCCL_IB_HCA / NCCL_IB_GID_INDEX / NCCL_SOCKET_IFNAME / GLOO_SOCKET_IFNAME before launching multi-node runs.
  • docs/node-smoke.md — full reference for the new smoke test (architecture, every report section, every flag, design history).
  • docs/node-smoke-test-instruction.md (new) — short quick-start guide for node-smoke.

How to use

# 1) Prune broken nodes with node-smoke (fast, no rendezvous).
srun -N "$SLURM_NNODES" --ntasks-per-node=1 \
    bash runner/run_node_smoke_direct.sh --tier2-perf

# 2) Run preflight --quick on the survivors for cross-node sanity.
srun -N <good-nnodes> -c 128 --gpus-per-node=8 --ntasks-per-node=1 \
    --exclude=$(paste -sd, output/preflight/failing_nodes.txt) \
    runner/run_preflight_direct.sh --quick

amd-fuyuajin and others added 20 commits May 4, 2026 18:40
…nd reports

Introduce CLI flags to select which perf tests to run, override message and
group sizes, and apply a fast pre-launch preset. Reorganize the report so
large clusters stay readable, and harden the dispatcher with clear mode
precedence and fail-fast validation.

Test selection
- Add --tests CSV with tokens: gemm, intra-allreduce, intra-alltoall,
  inter-allreduce, inter-alltoall, inter-p2p, inter-ring-p2p, all.
- Replace the all-or-nothing run with a token-driven dispatch loop that
  logs per-test wall-clock time.
- Add --quick preset (gemm + intra-allreduce + inter-allreduce, sizes
  64,1024 MB, full intra/inter groups, lowered warmup/iters) for a fast
  pre-launch sanity check.

Configurable sizes / groups
- Add --comm-sizes-mb (global) and per-scope overrides
  --intra-comm-sizes-mb, --inter-comm-sizes-mb, --ring-p2p-sizes-mb.
- Add --intra-group-sizes and --inter-group-sizes (supports 'all' token).
- All knobs default to None so user-supplied values are detectable.

Runtime warmup/iteration overrides
- Add set_warmup/set_iteration/get_warmup/get_iteration in global_vars.
- Update square_gemm, intra_node_comm, inter_node_comm, inter_node_comm_p2p,
  and inter_node_ring_p2p to read via the getters so --quick takes effect.

Report readability
- Add Node -> Hostname legend at the top of the perf report (also mirrored
  to console).
- Rename the per-row Hostname column to "Leader hostname" and show only the
  first node of each group.
- Use compact ranges for Node and Rank columns (e.g. 0-3, 0-15) via a new
  format_int_range helper.
- Restore missing column headers in the console output for all perf tables.

Mode precedence and safety
- --tests and --quick auto-imply --perf-test.
- Perf intent wins over info selectors: when --perf-test/--tests/--quick is
  mixed with --host/--gpu/--network, info selectors are dropped with an
  explicit WARN (stderr + markdown).
- Tuning knobs set without any perf intent are inert and emit a quieter
  WARN ("knob X has no effect without --perf-test/--tests/--quick").
- Centralize PERF_INTENT_FLAGS and PERF_TEST_TOKENS in preflight_args.
- Keep --no-split-nodes-subgroup as a deprecated alias.

Validation hardening
- _resolve_perf_config is now side-effect free; it returns warmup/iteration
  for the caller to apply instead of mutating module globals.
- Validate intra/inter/ring knobs only when the corresponding tests are
  selected (e.g. --tests gemm --intra-group-sizes 3 no longer aborts).
- Resolve and validate perf config BEFORE init_distributed so typos and bad
  sizes fail in milliseconds instead of after a 120s NCCL rendezvous.
- Reject --tests values that yield zero valid tokens (e.g. ",,,") with a
  clear error instead of silently running no perf tests.
- Drop unused format_host_range/_split_host_suffix and stale imports.

Made-with: Cursor
…creening

Add a lightweight, distributed-rendezvous-free smoke test that runs on every
node in parallel under SLURM and quickly identifies broken nodes before a
large-scale training job commits to a global rendezvous. Designed for the
common case where we own full nodes and care which *node* is sick, not which
GPU within an otherwise-healthy node.

Architecture
- primus/tools/preflight/node_smoke.py: per-node Python entry with three
  argparse subcommands (run, aggregate, _per_gpu).
- runner/run_node_smoke_direct.sh: SLURM/bash wrapper, modeled after
  run_preflight_direct.sh. No MASTER_ADDR / no torch.distributed rendezvous;
  every node runs independently, NODE_RANK==0 aggregates.

Tier 1 (mandatory, ~5 s/GPU)
- For each GPU, spawn an isolated Python subprocess with a hard timeout that
  performs torch.cuda.set_device, a 256 MB allocation, and a tiny GEMM with
  an isfinite() check. Catches stale / hung GPUs that pass enumeration but
  fail the first real op.
- Reuse existing collect_gpu_info / collect_host_info /
  collect_network_info(expect_distributed=False) and add a recent-dmesg scan
  for known hardware error patterns.

Tier 2 (optional perf sanity, --tier2 / --tier2-rccl)
- Per-GPU steady-state GEMM TFLOPS (8192^3 bf16) and HBM bandwidth measured
  via device-to-device torch.Tensor.copy_ (counts read+write), gated by
  thresholds.
- Node-local 8-GPU all-reduce via torch.multiprocessing.spawn over a
  127.0.0.1 process group, with a hard timeout. Measures algorithmic
  bandwidth at 64 MB. Iteration counts (warmup=5, iters=20 for GEMM and
  RCCL; warmup=10, iters=20 for HBM) are aligned with the preflight --quick
  preset so smoke and preflight report comparable steady-state numbers.

Per-node JSON + cluster aggregation
- Each node writes <dump>/smoke/<host>.json with status, fail_reasons,
  duration, tier1 (per-GPU details + system probes), and tier2 sections.
- Aggregator on NODE_RANK==0 polls for the expected number of JSONs, then
  emits:
    * smoke_report.md with a status table, a Tier 2 perf summary
      (per-node GEMM / HBM min/median/max + local RCCL GB/s), and a
      "Failing nodes" detail section.
    * passing_nodes.txt / failing_nodes.txt suitable for piping straight
      into srun --nodelist / srun --exclude. Synthetic <missing-N>
      placeholders for nodes that never reported are kept in the markdown
      report but excluded from the txt files.
- Aggregator returns non-zero if any node FAILs or the expected count is
  not met, so the wrapper script propagates a meaningful exit code.

Verified at scale
- Successful 6-node run on tus1-p3-g[14,15,25,26,27,29] with --tier2
  --tier2-rccl: all nodes PASS, ~58 s wall clock per node, GEMM 702-733
  TFLOPS, HBM 3.7-4.2 TB/s, local RCCL 197-201 GB/s.
- Cross-checked formulas against square_gemm.py and intra_node_comm.py:
  identical AlgBW (2*S*(P-1)/P) and TFLOPS (2*N^3/t) definitions, so smoke
  and preflight numbers are directly comparable.

Made-with: Cursor
…limit checks

The node-local smoke test previously caught GPU-level failures (Tier 1) and
optional perf regressions (Tier 2). It missed three of the most common
"job dies at minute 3" causes at scale: software-stack drift between nodes,
silently degraded RDMA NICs, and host limits that block RDMA pin / NCCL
shared-memory under load. This commit adds all three in Tier 1 with no
extra runtime (millisecond-scale sysfs reads).

A. Software-stack fingerprint + cluster drift detection
- New _collect_node_fingerprint() captures kernel, OS, Python, ROCm
  (/opt/rocm/.info/version), amdgpu driver (/sys/module/amdgpu/version),
  PyTorch + torch.version.hip, RCCL version (torch.cuda.nccl.version()),
  librccl.so path, plus per-IB-device firmware (fw_ver) and HCA model.
- Aggregator computes the cluster-majority value for every scalar
  fingerprint key and emits a "Stack drift across cluster" section
  listing only outliers (e.g. one node on RCCL 2.21 while the rest are
  on 2.22). NIC firmware drift is reported per-IB-device in its own
  "NIC firmware drift" section so a flashed-differently NIC is named.
- Healthy clusters render *All nodes match.* placeholders so the report
  stays short.

B. NIC / RDMA roll-call (per-port, from sysfs only)
- New _collect_nic_status() inventories every port under
  /sys/class/infiniband (no ibv_devinfo / ibstat dependency, works
  inside containers). Per port we capture state, phys_state, link rate,
  netdev + MTU, total non-zero GIDs, and the RoCE v2 GID subset.
- Hard-fail rules (cause node FAIL): any port not ACTIVE / not LinkUp,
  any active port with zero RoCE v2 GIDs, or NIC count != the optional
  --expected-rdma-nics N.
- Aggregator's "NIC / RDMA roll-call issues" table pinpoints the
  offending node + port + reason.

C. Host limits / system tunables
- New _collect_host_limits() captures RLIMIT_MEMLOCK, RLIMIT_NOFILE,
  RLIMIT_NPROC, /dev/shm size + free, NUMA node count, CPU count, and
  cpu0 scaling_governor.
- Hard-fail rules: RLIMIT_MEMLOCK finite and below --ulimit-l-min-gb
  (default 32 GiB) -> "RDMA pin will fail under load"; /dev/shm size
  below --shm-min-gb (default 8 GiB) -> "NCCL shared-mem may fail".
- Aggregator's "Host limits issues" section lists violators with the
  exact value and required threshold.

Wiring + CLI
- Collectors are invoked unconditionally in _cmd_run after the existing
  reused info collectors, stored under tier1.fingerprint / tier1.nics /
  tier1.host_limits in the per-node JSON.
- _node_status_from() now adds nic: and host_limits: prefixed reasons
  so per-node fail_reasons remain self-describing.
- New `run` flags:
    --expected-rdma-nics N      FAIL on count mismatch (default: report only)
    --ulimit-l-min-gb GB        FAIL threshold (default 32; 0 disables)
    --shm-min-gb GB             FAIL threshold (default 8;  0 disables)
- Wrapper script needs no changes; unknown flags are forwarded as-is.

Verified
- Live single-node run on tus1-p3-g25: fingerprint populated (ROCm
  6.4.2, amdgpu 6.12.12, RCCL 2.28.9, NIC fw 231.2.63.0 across all 8
  rdma devices); NIC roll-call shows 8/8 ports ACTIVE/LinkUp at 400 Gb/s,
  MTU 9000, >=1 RoCE v2 GID each, 0 issues; host limits show memlock
  405 GiB, /dev/shm 1.6 TiB, governor=performance, 0 fail_reasons. All
  four new report sections render the *empty* placeholders cleanly.
- Synthetic two-node drift test (one real + one edited copy): outlier
  node correctly surfaces in Stack drift (rccl, amdgpu_driver), NIC
  firmware drift (rdma3 only), NIC issues (rdma2:1 DOWN), and Host
  limits (memlock 64 MiB violation). Per-node fail_reasons and exit
  code propagate as expected.

Made-with: Cursor
…dd port-count outlier section

In _stack_drift_rows() the comparison-key set was populated whenever any
node reported a key as a scalar OR as None. On a heterogeneous cluster
(some nodes with an IB stack, some without) "nic_fw" is None on one node
and a dict on the others. The dict then reached collections.Counter and
crashed the aggregator with `TypeError: unhashable type: 'dict'`. The
crash happened mid-write, so smoke_report.md was truncated and
passing_nodes.txt / failing_nodes.txt were never produced -- so an 18-node
SLURM run that successfully wrote 17 per-node JSONs ended up with no
usable cluster verdict.

Changes
- _stack_drift_rows: only collect a key when at least one node reports
  it as a real scalar (drop the "None counts as scalar" path); plus a
  defense-in-depth isinstance check inside the per-host loop so the same
  crash is impossible if a future schema mixes scalar and dict for the
  same key.
- Wrap each report section (Stack drift, NIC firmware drift, NIC issues,
  Host limits) in its own try/except. A bug in one section now records
  "*Section X failed to render: ...*" inline and the rest of the report
  still gets written.
- Add a "NIC port-count summary" section that always renders, computes
  the cluster-majority port count, and lists every node that disagrees.
  This catches partial-NIC-degradation cases (e.g. one node enumerating
  0 or 7 of 8 RDMA NICs) without requiring --expected-rdma-nics. Wrapped
  in try/except like the others.

Verified
- Local repro of the original failure (one node nic_fw=None, one node
  nic_fw=dict): aggregator now exits 0 and writes a complete report,
  with the port-count outlier surfaced in the new summary section.
- Existing single-node and synthetic-drift smoke flows still produce the
  same output, including the empty-state placeholders on a homogeneous
  cluster.
… reported

- Normalize host -> short name in _cmd_run (JSON filename + host field)
  and defensively in _cmd_aggregate so legacy FQDN JSONs produce
  SLURM-ready passing/failing txt files without re-running the smoke.
- New `aggregate --expected-nodelist-file FILE`: missing nodes are
  named by their real short hostname (instead of <missing-N>) and
  written directly to failing_nodes.txt.
- runner/run_node_smoke_direct.sh: rank 0 auto-populates the file from
  `scontrol show hostnames "$SLURM_JOB_NODELIST"`. Best-effort.
…ocm-smi self-latency

Adds four new Tier 1 collectors and matching aggregator sections so the
smoke test catches a broader class of "node will silently degrade
training" failures before launch.

Per-node collectors (one call per node, results in tier1.<key>):
- gpu_low_level: amd-smi metric --json (text fallback) -> per-GPU power,
  GFX clock, edge temp, ECC counters, throttle status. Schema-tolerant.
- xgmi:          amd-smi topology -> parses the LINK TYPE TABLE into a
  BDF-indexed square matrix; records every non-XGMI pair.
- clock:         time.time(), monotonic, and systemctl is-active for
  chronyd/ntp/ntpd/systemd-timesyncd.
- tooling:       times rocm-smi --version against a hard timeout
  (default 5 s) -- a wedging amdgpu driver typically hangs rocm-smi
  for 30-60 s before the GPU itself stops responding.

Per-GPU subprocess gain (D-1 light, sysfs + torch only, no shell-out):
- details.low_level: pci_bdf, pcie_link_speed_{raw,gts}, pcie_link_width
  (from /sys/bus/pci/devices/<bdf>) plus hbm_total_bytes/hbm_free_bytes
  (torch.cuda.mem_get_info).

New hard fails in _node_status_from:
- per-GPU ecc_uncorrectable_total > 0 -> node FAIL.
- any non-XGMI pair in the topology matrix -> node FAIL (intra-node
  collectives silently fall back to PCIe and lose 5-10x bandwidth).
- rocm-smi --version timeout -> node FAIL (driver wedging signal).
Throttle reasons and time-daemon health are recorded but not failed-on
(schema is too vendor-specific / cluster-culture-specific for a default).

New CLI flags:
- run --rocm-smi-timeout-sec   (default 5.0)
- aggregate --rocm-smi-warn-sec  (default 1.0)
- aggregate --clock-skew-warn-sec (default 30.0; loose because the
  spread also includes srun launch jitter)

New aggregator sections in smoke_report.md (each wrapped in its own
try/except so a single bug can never truncate the report):
- GPU low-level outliers (PCIe link / HBM): per-GPU values that diverge
  from the cluster majority, listed as host:gpu = value.
- XGMI link issues: per-node, with up to 6 sample non-XGMI pairs each.
- Cluster clock + time daemons: wall-clock spread (with earliest/latest
  hosts) plus a sub-table of any nodes with no active time-sync daemon.
- Tooling self-latency: any node that hit the rocm-smi timeout (FAIL)
  or exceeded --rocm-smi-warn-sec.

Verified locally: amd-smi metric/topology/rocm-smi calls all complete;
XGMI parser handles the real multi-section BDF-labelled output (8x8 SELF
on diagonal, all XGMI off-diagonal); end-to-end run + aggregate produces
a clean smoke_report.md with all four new sections rendering cleanly.
…PU / stale-driver nodes can't PASS

Previously, a node where torch.cuda.device_count() resolved to 0 could
silently PASS smoke if `_collect_reused_info()` failed to surface the
"No GPUs detected" finding -- e.g. when collect_gpu_info() raises and
the wrapper downgrades the failure to level="warn". That's exactly the
class of failure (stale ROCm install, wedged amdgpu driver) the smoke
test exists to catch, so the FAIL must not depend on any other
collector's behavior.

Adds a self-contained guard in _cmd_run that captures every independent
GPU-count source -- the --expected-gpus flag, LOCAL_WORLD_SIZE,
GPUS_PER_NODE, torch.cuda.is_available(), torch.cuda.device_count(),
and (after _collect_amd_smi_metrics) amd-smi -- into a new
tier1.gpu_visibility block, with two hard-fail rules:

1. expected_gpus < 1 -> hard fail with full diagnostic context.
2. amd-smi sees more GPUs than torch -> hard fail. This is the
   high-signal stale-ROCm / wedged-driver signature.

_node_status_from now prepends gpu_visibility:* reasons before any
other check, so the visibility verdict is independent of the reused
gpu_info collector. The aggregator gets a dedicated
"## GPU visibility issues" section that surfaces expected / torch /
amd-smi counts side by side per node.

Verified locally: on a host where torch can't see GPUs but amd-smi
sees 8, both reasons land in fail_reasons ahead of any reused-collector
finding and the node correctly FAILs.
…ubstring

The dmesg scanner used `p in ll` (substring match) over a list that
included regex-looking patterns like "amdgpu.*error". As a result the
amdgpu pattern essentially never fired against real kernel lines:

  amdgpu 0000:05:00.0: amdgpu_device_resume failed: -19
  amdgpu: [drm] *ERROR* ring sdma0 timeout, signaled seq=12345

both slipped past the scan, defeating the dmesg check for the most
common amdgpu failure modes.

Switch to compiled regex matching with re.IGNORECASE. Patterns are
documented as regex-by-contract; a malformed pattern is recorded into
the dmesg block's `pattern_errors` field and never aborts the scan.

Pattern changes:
- "xid"            -> r"\bxid\b"   (avoid matching auxiliary etc.)
- "amdgpu.*error"  -> r"amdgpu.*(error|fail|timeout)"  (real formats)
- added            r"\*error\*"    (catches "[drm] *ERROR*")

All previously-literal patterns ("hardware error", "gpu reset",
"hung_task", "soft lockup", ...) work unchanged because they contain
no regex metacharacters.

Verified against real amdgpu / NVRM Xid / MCE / soft-lockup /
hung_task / page-allocation lines (all match) and benign systemd /
audit lines (none match).
…mp cleanup

Several real-cluster paper-cuts uncovered while running node-smoke on
4-8 node SLURM jobs. None of these change the diagnostic content of the
report -- they fix surprising / wrong behaviour around the edges.

1. Per-node `run` now always exits 0 when the JSON was written.
   Previously it returned 1 whenever a node was diagnosed FAIL, so srun
   printed one "error: ... task N: Exited with exit code 1" line per
   bad node. That conflates "this node is broken" (a successful
   diagnosis) with "this tool crashed" (a real failure) and made it
   look like the smoke test itself was broken whenever it correctly
   identified a problem. The cluster-health verdict still flows out
   via the aggregator's exit code on rank 0 (single CI-friendly
   signal) and via failing_nodes.txt; tool-internal failures still
   propagate non-zero through Python's default exception handling.

2. Replace --tier2 + --tier2-rccl with a single --tier2-perf flag.
   The old pair allowed --tier2-rccl on its own to silently skip RCCL
   (because runtime required both flags), and --tier2 alone silently
   skipped RCCL on single-GPU nodes. Both gave false coverage
   confidence. --tier2-perf now turns on GEMM + HBM + node-local RCCL
   together. The `run` subparser uses allow_abbrev=False so an old
   `--tier2` left in a script errors out loudly instead of being
   silently prefix-matched to --tier2-perf. A warn is emitted up front
   if --tier2-perf is requested on a node with < 2 visible GPUs so the
   RCCL skip is never silent.

3. Robust PCIe BDF resolution (_resolve_gpu_bdf).
   torch.cuda.get_device_properties(i).pci_bus_id is polymorphic across
   PyTorch + ROCm versions: sometimes a canonical string, sometimes
   just the bus byte as int. The old code called .lower() on it and
   crashed inside a broad try/except, silently losing PCIe link width /
   speed and HBM totals from the report. The new helper handles both
   string and int forms, verifies sysfs paths, and the per-GPU low-
   level capture splits PCIe and HBM into independent try blocks with
   dedicated error keys so one missing piece never costs us the other.

4. Auto-clean stale artifacts in --dump-path on rank 0 at startup
   (_clean_dump_path), with --no-clean-dump-path to opt out. Without
   this, a re-run on a smaller nodelist would leave per-node JSONs
   from removed nodes in <dump>/smoke/ and the aggregator would happily
   count them as PASS, contaminating the report. Cleanup is rank-0
   only and runs before any rank can have written its current-run
   JSON, so it is race-safe.

5. runner/run_node_smoke_direct.sh: docstring updated to mention
   --tier2-perf instead of the removed --tier2 / --tier2-rccl.
…GPUs

The single most common reason a "healthy" cluster fails to launch a large
training job is that a previous job's Python ranks are still attached to
the GPUs (held HBM, half-torn-down NCCL communicators, or just stuck in
__del__). Symptoms in the new job: torch.cuda.OutOfMemoryError at model
init with a misleading "free=Y" message, NCCL/RCCL bootstrap hang, or
random ranks failing the first all-reduce due to compute contention.

This commit adds three Tier 1 checks (all node-level, all run before any
per-GPU subprocess attaches to the device, so we only see foreign work):

1. Foreign / leaked process enumeration -- _collect_gpu_processes()
   Tries `amd-smi process --json` -> `amd-smi process` (text) -> `lsof
   /dev/kfd /dev/dri/renderD*` and records {pid, name, hbm_bytes,
   is_self, is_allowed, is_foreign} per GPU. A PID is treated as ours
   (and excluded) if its pgid matches our own; everything else is
   foreign unless its name is in --allowed-procs (e.g.
   "rocm-smi-daemon,amd-smi,dcgm-exporter"). Hard FAIL by default;
   --allow-foreign-procs downgrades to report-only.

2. Pre-touch HBM-busy check -- in _per_gpu_body
   torch.cuda.mem_get_info is now called BEFORE we allocate anything on
   the GPU, so the "used" reading reflects only foreign occupancy. Hard
   FAIL if any GPU has > --hbm-busy-threshold-gib (default 2.0) used at
   that point. The previous post-test reading is biased by PyTorch's
   caching allocator (which doesn't truly release pages on
   empty_cache()) and was therefore not safe to threshold-check.

3. GPU compute-activity warn -- gfx_activity_pct in _flatten_amd_smi_metric_json
   Surfaces gpus reporting >= --gpu-activity-warn-pct (default 20%) at
   smoke start. Warn-only because short bursts are normal, but a
   sustained pegged-100% across multiple GPUs strongly indicates a
   leaked rank still running compute.

Aggregator output (smoke_report.md):

  ## Busy GPUs / leaked processes
  | Node | Hostname | GPU | PID | Process | HBM held (GiB) |

  ## GPU pre-touch HBM usage outliers
  | Node | Hostname | GPU | HBM used pre-touch (GiB) |

  ## GPU compute-activity outliers
  | Node | Hostname | GPU | Activity % |

failing_nodes.txt now includes any node with a foreign GPU process or
excessive pre-touch HBM, so the operator can `srun --exclude=` them or
`pkill -9 -f train.py` and retry.

New CLI flags (run):
  --hbm-busy-threshold-gib N   FAIL if pre-touch HBM used > N GiB. Default 2.0.
  --allow-foreign-procs        Downgrade foreign-process FAIL to report-only.
  --allowed-procs name1,name2  Whitelist known agents.
  --gpu-activity-warn-pct N    Aggregator warn threshold. Default 20.

The same threshold flags are mirrored on `aggregate` so the report
labels its sections with the numbers each node was configured with, and
on the internal _per_gpu subcommand so the spawned subprocess receives
--hbm-busy-threshold-gib.

Verified:
- Real 8-GPU node, no foreign processes -> sections render with
  reassuring "no issues" text; gpu_processes.tool == "amd-smi process
  --json", foreign_count == 0.
- Synthetic JSON with 2 foreign PIDs + 1 pre-touch outlier + 2 active
  GPUs -> all three tables populate; idle/clean GPUs filtered out.
- _node_status_from default -> precise FAIL message with PID/name/HBM;
  --allow-foreign-procs -> no FAIL (still in report).
…d amd-smi process JSON parser

Two related gaps in busy-GPU / leaked-process detection: (1) checks
silently no-op'd when amd-smi was missing, and (2) on nodes where
amd-smi is present, the modern (>=6.x) `amd-smi process --json` schema
broke our parser so the operator-facing "who is holding the GPU" table
came back empty -- even though pre-touch HBM had correctly flagged the
node as FAIL.

Tooling availability + rocm-smi fallbacks
-----------------------------------------
- Inventory amd-smi / rocm-smi / lsof at runtime; emit a loud WARN on
  rank 0 listing exactly which checks lose coverage.
- Always-on "Tooling availability" section in the aggregator report,
  with per-tool presence and per-check fallback status.
- `run --require-tools <csv>` promotes missing required tools to a hard
  node FAIL for strict CI environments.
- Add four rocm-smi fallback parsers producing the same per-GPU schema
  as their amd-smi counterparts:
    * `_rocm_smi_ras_info_text`   -> ECC counters
    * `_rocm_smi_topotype_json`   -> XGMI link matrix
    * `_rocm_smi_processes`       -> foreign processes (--showpids)
    * `_rocm_smi_use_json`        -> gfx_activity_pct (--showuse)
  Wired into `_collect_amd_smi_metrics`, `_collect_xgmi_topology`, and
  `_collect_gpu_processes` so coverage stays close to full when only
  rocm-smi is installed.
- Default `--allowed-procs` now includes node-resident agents
  (`gpuagent`, `rocm-smi-daemon`, `amd-smi`, `dcgm-exporter`).

amd-smi process JSON parser fix
-------------------------------
Real `amd-smi process --json` output (verified on a busy MI300X) is
double-nested in two ways the old parser didn't handle:

    [{"gpu": 0, "process_list": [
        {"process_info": {                            <-- extra wrapper
           "pid": 2669301,
           "memory_usage": {
             "vram_mem": {"value": 23044481024, "unit": "B"}  <-- dict
           }
        }}
    ]}]

The old code did `p.get("pid")` directly on the `{"process_info": ...}`
wrapper -> got None -> silently dropped every process. Even if it had
reached `_hbm_of`, the dict-with-unit memory shape wasn't recognised.
Net effect: `gpu_processes.foreign_count == 0` on nodes that visibly
had 8x ~23 GB leaked python ranks holding HBM.

  - New `_unwrap_proc()` peels off `process_info` if present, so modern
    and older amd-smi shapes flow through one path.
  - New `_value_unit_to_bytes()` resolves int / formatted string /
    `{"value": N, "unit": "..."}` uniformly via `_parse_size_with_unit`.
  - Updated docstring to record all three real-world shapes (modern A,
    older flat A', per-process B).

rocm-smi --showpids parser also extracts VRAM bytes
---------------------------------------------------
The documented field order is `name, num_gpus, vram_bytes, sdma,
cu_occupancy`. We were only taking field 0 (name) and passing None for
hbm_bytes, so even when the rocm-smi fallback fired the operator could
see which PIDs were leaked but not how much VRAM each was holding.
Now also takes field 2 as VRAM bytes (best-effort; tolerates older
shapes and dict/list values).

Verified against real captures from a busy node:
    amd-smi process --json:  0 PIDs (before)  -> 8 PIDs flagged foreign,
                                                 ~23.04 GB / 21.46 GiB each
    rocm-smi --showpids   : 10 PIDs, no HBM   -> 10 PIDs, 8 python3.11
                                                 foreign ~22.8 GiB each,
                                                 2 gpuagent allowed
amd-smi process, rocm-smi --showpids, and lsof /dev/kfd report PIDs in
the **root (host) PID namespace** -- KFD knows nothing about user
namespaces. os.getpid() returns the PID *as our own namespace sees it*.

On bare metal or SLURM + pyxis/enroot (shared host PID ns by default)
the two are equal and the naive `reported_pid == os.getpid()` test in
`_collect_gpu_processes` works. Inside Docker (default) or any k8s pod
the two differ -- causing our own training rank to be flagged
`is_foreign=True` and (with the default policy) failing the node.

Fix: new `_resolve_self_pid_view()` parses the NSpid line in
/proc/self/status to recover our root-namespace PID. The matcher in
`_collect_gpu_processes` now uses that host-side PID directly. The
pgid-match path is preserved on bare metal but skipped inside a
private PID namespace (os.getpgid on host PIDs we cannot see would
always ESRCH).

Output JSON gains `self_host_pid`, `pid_namespaced`, and the full
`ns_pid_chain` for forensics across container boundaries.

Verified: bare metal, private PID ns w/ own rank + leak, private PID
ns w/ leak + allowed agent -- all classify correctly. The
private-PID-ns-w/-own-rank case was the bug (previously foreign=2,
now foreign=1).

Net effect: zero behavior change on SLURM + pyxis/enroot; own rank no
longer false-flagged on Docker / k8s.
The 4,487-line node_smoke.py is now a node_smoke/ package with one
module per responsibility, while preserving:
  - the `python -m primus.tools.preflight.node_smoke` entry point
  - CLI flags, help text, and exit-code semantics for run/_per_gpu/aggregate
  - JSON schema/keys and markdown report section order
  - runner/run_node_smoke_direct.sh wrapper behavior

Layout:
  types.py, logging_utils.py, shell_utils.py    leaf helpers
  per_gpu.py, rccl_local.py                     in-process workloads
  collectors/                                   per-area data gatherers
                                                (dmesg, fingerprint, nics,
                                                host_limits, gpu_low_level,
                                                gpu_processes, xgmi, clock,
                                                rocm_smi, tooling, reused_info)
  orchestrator.py                               spawn _per_gpu + status roll-up
  aggregator/summarizers.py                     row/summary helpers
  aggregator/report.py                          markdown writer, one helper per
                                                ## section (was ~700-line block)
  cli.py                                        argparse + run/_per_gpu/aggregate
  tests/test_node_smoke.py                      22 unit + parity tests

Tier 2 perf summary and "Failing nodes -- full reasons" keep their existing
error-handling exactly (no new try/except). Verified end-to-end with the
entrypoint matrix and a JSON/markdown diff against a pre-refactor baseline
(time-variant fields allow-listed).

Docs: docs/node-smoke.md updated with the new module layout, dependency
diagram, refreshed flag tables, and a design-overview entry in History.
…mi reports 'N/A'

Some amd-smi / rocm-smi builds emit `name="N/A"` (or "", "none", "-",
"unknown", ...) for kernel/system-owned PIDs like `gpuagent` because they
cannot read /proc/<pid>/comm themselves. The allowlist check is purely
name-based, so these placeholders never matched any whitelisted name and
every healthy node with a running gpuagent was incorrectly FAILed with
`gpu_processes: foreign process(es) holding GPU(s) ... name='N/A'`.

Fall back to /proc/<pid>/comm (then /proc/<pid>/status `Name:`) inside
`_annotate` whenever the upstream name is a known placeholder, then
re-evaluate `is_allowed` against the resolved name. Original placeholder
is preserved on the per-process record under `name_raw` plus a boolean
`name_resolved_from_proc` so the JSON keeps the audit trail.

Side benefit: the report's "Busy GPUs / leaked processes" table now shows
real names ('python', 'gpuagent', ...) instead of 'N/A', so operators can
finally see what to pkill on nodes with actual leaked ranks.

Co-authored-by: Cursor <cursoragent@cursor.com>
…art, recommend smoke-then-preflight workflow

- preflight.md: rewrite as the comprehensive reference for the now-configurable
  preflight tool. Cover info-only / perf-only / default mode precedence;
  --tests token list; --quick preset substitutions; per-knob tuning of message
  sizes, group sizes, ring-P2P sizes, and plotting; reliability knobs
  (--comm-cleanup-delay-sec, --dist-timeout-sec); validation behavior
  (fail-fast before NCCL init); reporting flags; backward-compat aliases;
  comparison with node-smoke; recommended pre-launch sequence.
- preflight-direct.md: turn into a quick-start guide. Add a top-of-file
  "Which test should I run?" comparison plus a 3-step smoke-then-preflight
  workflow snippet. Replace the monolithic example block with 10 labeled
  subsections (A-J) covering every configurable knob. Document the
  minimum-dependency install matrix per tool/mode (torch is the only hard
  requirement; markdown2/weasyprint only for PDFs; matplotlib only for
  --plot); flag the existing requirements.txt path as 'full Primus runtime'
  and not necessary for preflight/smoke alone. Add a one-line callout
  before the multi-node example to verify NCCL_IB_HCA / NCCL_IB_GID_INDEX /
  NCCL_SOCKET_IFNAME / GLOO_SOCKET_IFNAME before launching.
- node-smoke-test-instruction.md (new): short get-started guide for
  node-smoke, organized as quick-start + 10 example subsections per
  configurable knob + outputs + cheat sheet + troubleshooting. Links
  back to node-smoke.md for the full reference.
- node-smoke.md: refine the opening paragraph to drop the awkward
  "GPU vs node" framing (training jobs allocate whole nodes anyway,
  so node-granularity verdicts are the right unit). Add a callout
  pointing newcomers at node-smoke-test-instruction.md.
Comment thread primus/tools/preflight/node_smoke/collectors/gpu_processes.py
Comment thread primus/tools/preflight/node_smoke/per_gpu.py Outdated
Copy link
Copy Markdown
Collaborator

@amd-ama10002-2 amd-ama10002-2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall — approving. Nice work on the smoke test, and the new unit tests are a great addition. Two follow-up requests below; happy for these to land in a separate PR if it's easier.

1. I think the new unit tests aren't actually put into the CI?

The test file lives at primus/tools/preflight/node_smoke/tests/test_node_smoke.py (in-source), but our CI only points pytest at ./tests/?

2. I'm not sure if the tests cover all of the new and updated features

I didn't manually test the new features but I think it would be good in the future to make it easy to test our features so that we avoid introducing new bugs.

Also, a nit, I tend to prefer having 1 new or updated feature == 1 PR, so that we have small, frequent PRs. I find this easier to do PR reviews and it lets us merge features iteratively instead of in a large batch. In this example, I probably would've liked this to be multiple smaller PRs 🤷‍♂️

Just something to consider for the future🙂. We don't need to spend time breaking up this PR at all 🙂👍

Thanks again for the work!

…s / HBM-busy checks

Two related hardenings to the per-node smoke test so a sick node can no
longer slip through as clean:
- gpu_processes.py: previously, when `amd-smi process --json` returned
  rc=0 with valid JSON but an unknown / future schema, the parser
  returned [] and the caller still set ok=True, foreign_count=0 -- the
  rocm-smi / lsof fallbacks never ran. The empty-result case was
  indistinguishable from "schema matched, no processes" because
  `_flatten_amd_smi_process_json` only registered a per-GPU bucket when
  it pushed at least one process.
  Fix: register the per-GPU bucket as soon as a Shape A / A' entry is
  recognized (even with an empty `process_list`). Now [] unambiguously
  means "schema didn't match", and `_collect_gpu_processes` gates
  ok=True on a non-empty parsed result, falling through to text /
  rocm-smi / lsof on schema mismatch and recording a clear
  json_parse_error.
- per_gpu.py: tighten the pre-touch HBM-busy check from
  `used_b > hbm_busy_threshold_bytes` to `>=`, so a GPU sitting exactly
  at the threshold is treated as busy (likely leaked from a previous
  job) instead of squeaking through.
@amd-fuyuajin
Copy link
Copy Markdown
Collaborator Author

@amd-ama10002-2 Thanks for your suggestions.

  1. regarding the CI, I did not want to put the unit tests in the CI because I thought this is still a feature branch. Maybe we can add that when we merge the branch into main.
  2. regarding the smaller PR, you suggestion is definitely better practice. I will follow in the future.

@amd-fuyuajin
Copy link
Copy Markdown
Collaborator Author

Here is an example of the node smoke test summary report. The output also include per-node json files with detailed info, a passing node list and a failing node list, which can be used to exclude in the following srun command. They are not shown here.

Node-Local Smoke Test Report

  • Expected nodes: 16
  • Reported nodes: 16
  • PASS: 15 FAIL: 1
Node Hostname Status Duration Top fail reason
0 tus1-p3-g2 PASS 24.787s
1 tus1-p3-g14 PASS 23.053s
2 tus1-p3-g15 PASS 24.229s
3 tus1-p3-g25 FAIL 23.038s gpu0: FAIL: pre-touch HBM busy: 213.14 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previou...
4 tus1-p3-g26 PASS 28.08s
5 tus1-p3-g27 PASS 23.741s
6 tus1-p3-g29 PASS 25.238s
7 tus1-p3-g32 PASS 23.82s
8 tus1-p3-g50 PASS 23.867s
9 tus1-p3-g51 PASS 23.707s
10 tus1-p3-g52 PASS 23.531s
11 tus1-p3-g53 PASS 24.463s
12 tus1-p3-g54 PASS 24.733s
13 tus1-p3-g55 PASS 24.305s
14 tus1-p3-g57 PASS 24.552s
15 tus1-p3-g59 PASS 24.727s

Stack drift across cluster

Key Majority (count/total) Outlier nodes
rocm 6.4.2-120 (15/16) tus1-p3-g57 = 7.2.0

NIC firmware drift across cluster

All NIC firmwares match (or no NICs reported).

NIC / RDMA roll-call issues

No NIC issues.

NIC port-count summary

Cluster-majority port count: 8 (seen on 16/16 nodes).

Every node reports the majority count.

Host limits issues

No host-limit issues.

GPU visibility issues

Every node resolved expected_gpus >= 1 and torch + amd-smi agree on the GPU count.

GPU low-level outliers (PCIe link / HBM)

All GPUs match the cluster majority on PCIe link and HBM total.

XGMI link issues

All GPU pairs report XGMI on every node (or amd-smi topology was unavailable).

Cluster clock + time daemons

  • Wall-clock spread across 16 nodes: 1.555s.
  • Earliest: tus1-p3-g25, latest: tus1-p3-g54.
  • (Spread is an upper bound on real clock skew -- it also includes srun launch jitter.)

Every node has at least one active time-sync daemon.

Tooling self-latency (rocm-smi --version)

No nodes exceeded the warn threshold (1.0s) and no timeouts.

Tooling availability

Every tracked tool (amd-smi, rocm-smi, lsof) was present in PATH on every node.

Busy GPUs / leaked processes

Foreign PIDs found holding GPUs at smoke start. The most common cause is leaked Python ranks from a previous training job (look for python / torchrun / train.py). Clean up with pkill -9 -f train.py (or similar) on the listed nodes BEFORE launching the next job.

Node Hostname GPU PID Process HBM held (GiB)
3 tus1-p3-g25 0 987322 python3 ?
3 tus1-p3-g25 0 987790 sglang::schedul ?
3 tus1-p3-g25 0 987791 sglang::schedul 211.64
3 tus1-p3-g25 0 987792 sglang::schedul ?
3 tus1-p3-g25 0 987793 sglang::schedul ?
3 tus1-p3-g25 0 987794 sglang::schedul ?
3 tus1-p3-g25 0 987795 sglang::schedul ?
3 tus1-p3-g25 0 987796 sglang::schedul ?
3 tus1-p3-g25 0 987797 sglang::schedul ?
3 tus1-p3-g25 0 987798 sglang::detoken ?
3 tus1-p3-g25 1 987322 python3 ?
3 tus1-p3-g25 1 987790 sglang::schedul ?
3 tus1-p3-g25 1 987791 sglang::schedul ?
3 tus1-p3-g25 1 987792 sglang::schedul ?
3 tus1-p3-g25 1 987793 sglang::schedul 212.48
3 tus1-p3-g25 1 987794 sglang::schedul ?
3 tus1-p3-g25 1 987795 sglang::schedul ?
3 tus1-p3-g25 1 987796 sglang::schedul ?
3 tus1-p3-g25 1 987797 sglang::schedul ?
3 tus1-p3-g25 1 987798 sglang::detoken ?
3 tus1-p3-g25 2 987322 python3 ?
3 tus1-p3-g25 2 987790 sglang::schedul ?
3 tus1-p3-g25 2 987791 sglang::schedul ?
3 tus1-p3-g25 2 987792 sglang::schedul 212.47
3 tus1-p3-g25 2 987793 sglang::schedul ?
3 tus1-p3-g25 2 987794 sglang::schedul ?
3 tus1-p3-g25 2 987795 sglang::schedul ?
3 tus1-p3-g25 2 987796 sglang::schedul ?
3 tus1-p3-g25 2 987797 sglang::schedul ?
3 tus1-p3-g25 2 987798 sglang::detoken ?
3 tus1-p3-g25 3 987322 python3 ?
3 tus1-p3-g25 3 987790 sglang::schedul 212.44
3 tus1-p3-g25 3 987791 sglang::schedul ?
3 tus1-p3-g25 3 987792 sglang::schedul ?
3 tus1-p3-g25 3 987793 sglang::schedul ?
3 tus1-p3-g25 3 987794 sglang::schedul ?
3 tus1-p3-g25 3 987795 sglang::schedul ?
3 tus1-p3-g25 3 987796 sglang::schedul ?
3 tus1-p3-g25 3 987797 sglang::schedul ?
3 tus1-p3-g25 3 987798 sglang::detoken ?
3 tus1-p3-g25 4 987322 python3 ?
3 tus1-p3-g25 4 987790 sglang::schedul ?
3 tus1-p3-g25 4 987791 sglang::schedul ?
3 tus1-p3-g25 4 987792 sglang::schedul ?
3 tus1-p3-g25 4 987793 sglang::schedul ?
3 tus1-p3-g25 4 987794 sglang::schedul ?
3 tus1-p3-g25 4 987795 sglang::schedul 212.35
3 tus1-p3-g25 4 987796 sglang::schedul ?
3 tus1-p3-g25 4 987797 sglang::schedul ?
3 tus1-p3-g25 4 987798 sglang::detoken ?
3 tus1-p3-g25 5 987322 python3 ?
3 tus1-p3-g25 5 987790 sglang::schedul ?
3 tus1-p3-g25 5 987791 sglang::schedul ?
3 tus1-p3-g25 5 987792 sglang::schedul ?
3 tus1-p3-g25 5 987793 sglang::schedul ?
3 tus1-p3-g25 5 987794 sglang::schedul ?
3 tus1-p3-g25 5 987795 sglang::schedul ?
3 tus1-p3-g25 5 987796 sglang::schedul ?
3 tus1-p3-g25 5 987797 sglang::schedul 212.2
3 tus1-p3-g25 5 987798 sglang::detoken ?
3 tus1-p3-g25 6 987322 python3 ?
3 tus1-p3-g25 6 987790 sglang::schedul ?
3 tus1-p3-g25 6 987791 sglang::schedul ?
3 tus1-p3-g25 6 987792 sglang::schedul ?
3 tus1-p3-g25 6 987793 sglang::schedul ?
3 tus1-p3-g25 6 987794 sglang::schedul ?
3 tus1-p3-g25 6 987795 sglang::schedul ?
3 tus1-p3-g25 6 987796 sglang::schedul 212.23
3 tus1-p3-g25 6 987797 sglang::schedul ?
3 tus1-p3-g25 6 987798 sglang::detoken ?
3 tus1-p3-g25 7 987322 python3 ?
3 tus1-p3-g25 7 987790 sglang::schedul ?
3 tus1-p3-g25 7 987791 sglang::schedul ?
3 tus1-p3-g25 7 987792 sglang::schedul ?
3 tus1-p3-g25 7 987793 sglang::schedul ?
3 tus1-p3-g25 7 987794 sglang::schedul 212.19
3 tus1-p3-g25 7 987795 sglang::schedul ?
3 tus1-p3-g25 7 987796 sglang::schedul ?
3 tus1-p3-g25 7 987797 sglang::schedul ?
3 tus1-p3-g25 7 987798 sglang::detoken ?

GPU pre-touch HBM usage outliers

GPUs with more than 2.0 GiB of HBM already in use BEFORE smoke touched the device. This number is not polluted by our own caching allocator (it's measured before any allocation), so it directly reflects foreign or leaked occupancy.

Node Hostname GPU HBM used pre-touch (GiB)
3 tus1-p3-g25 0 213.14
3 tus1-p3-g25 1 212.34
3 tus1-p3-g25 2 213.17
3 tus1-p3-g25 3 213.18
3 tus1-p3-g25 4 212.89
3 tus1-p3-g25 5 213.04
3 tus1-p3-g25 6 212.92
3 tus1-p3-g25 7 212.9

GPU compute-activity outliers

No GPU exceeded gfx_activity_pct >= 20.0% at smoke start (or amd-smi did not report activity).

Tier 2 perf summary

Per-node GEMM TFLOPS (8192^3 bf16) and HBM GB/s shown as min / median / max across the node's GPUs. RCCL GB/s is the node-local 8-GPU all-reduce algorithmic bandwidth at 64 MB.

Node Hostname GEMM TFLOPS (min/med/max) HBM GB/s (min/med/max) Local RCCL GB/s
0 tus1-p3-g2 756.5 / 760.0 / 762.1 4406.3 / 4421.6 / 4449.7 268.6
1 tus1-p3-g14 748.5 / 762.7 / 764.7 4399.8 / 4437.2 / 4534.7 271.1
2 tus1-p3-g15 750.4 / 757.9 / 764.4 4359.1 / 4417.6 / 4423.9 269.9
3 tus1-p3-g25 269.8
4 tus1-p3-g26 761.3 / 764.6 / 768.2 4388.6 / 4425.3 / 4435.8 269.2
5 tus1-p3-g27 745.9 / 756.4 / 764.4 4409.3 / 4419.7 / 4429.6 269.2
6 tus1-p3-g29 756.2 / 764.6 / 769.5 4395.3 / 4419.5 / 4442.0 268.8
7 tus1-p3-g32 752.1 / 763.7 / 769.6 4402.3 / 4411.7 / 4430.1 270.0
8 tus1-p3-g50 754.2 / 762.5 / 767.3 4359.9 / 4424.3 / 4477.9 268.8
9 tus1-p3-g51 749.4 / 765.8 / 768.5 4404.1 / 4417.2 / 4445.4 268.8
10 tus1-p3-g52 747.0 / 762.8 / 771.7 4401.7 / 4418.7 / 4453.4 270.3
11 tus1-p3-g53 748.9 / 761.9 / 768.2 4399.0 / 4419.0 / 4446.0 267.6
12 tus1-p3-g54 758.9 / 762.3 / 764.8 4403.9 / 4428.4 / 4444.6 270.4
13 tus1-p3-g55 751.6 / 762.5 / 769.5 4408.9 / 4422.0 / 4432.9 269.9
14 tus1-p3-g57 748.3 / 759.0 / 768.0 4399.3 / 4414.6 / 4427.6 270.1
15 tus1-p3-g59 750.0 / 760.5 / 762.9 4407.6 / 4422.7 / 4431.4 269.8

Failing nodes -- full reasons

tus1-p3-g25

  • gpu0: FAIL: pre-touch HBM busy: 213.14 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
  • gpu1: FAIL: pre-touch HBM busy: 212.34 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
  • gpu2: FAIL: pre-touch HBM busy: 213.17 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
  • gpu3: FAIL: pre-touch HBM busy: 213.18 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
  • gpu4: FAIL: pre-touch HBM busy: 212.89 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
  • gpu5: FAIL: pre-touch HBM busy: 213.04 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
  • gpu6: FAIL: pre-touch HBM busy: 212.92 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
  • gpu7: FAIL: pre-touch HBM busy: 212.9 GiB already in use (threshold 2.0 GiB) -- likely leaked process from a previous job; see node-level gpu_processes section to identify the PID
  • gpu_processes: 80 foreign process(es) holding GPU(s) (e.g. gpu0: pid=987322 name='python3'; gpu0: pid=987790 name='sglang::schedul'; gpu0: pid=987791 name='sglang::schedul' hbm=211.64GiB) -- likely leaked rank(s) from a previous job. Clean up with pkill -9 -f train.py (or similar) or pass --allow-foreign-procs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants