Add Slurm preflight bisection helper by amd-ama10002-2 · Pull Request #720 · AMD-AGI/Primus

amd-ama10002-2 · 2026-05-09T00:51:29Z

Summary

This PR adds a Slurm-based preflight bisection helper script for isolating suspect nodes when multi-node preflight perf tests fail or hang. The new workflow runs preflight on recursively smaller node subsets, records per-trial logs, and writes a final summary of suspected faulty nodes.

It also adds two pytests for validating the bisection using a real pre-flight check in a slurm allocation

Test that the bisection passes on a set of healthy nodes in a slurm allocation
Test that bisection can correctly identify an unhealthy node (by temporarily creating a "bad node" on during execution)

in collaboration with: @akasharidas

Motivation

To reduce that manual effort when debugging NCCL/RCCL hangs or node-specific failures when using our preflight checks

Technical Details

Added tools/preflight_bisect/bisect.py, which expands a Slurm nodelist, runs runner/run_preflight_direct.sh --perf-test on subsets via srun, and recursively bisects failing or hanging subsets until suspect singleton nodes are identified.
Added tools/preflight_bisect/fake_runner.sh, a drop-in test runner that simulates one bad node on a live Slurm allocation without requiring an actually broken host.
Added tests/unit_tests/tools/test_preflight_bisect_slurm.py for opt-in live-Slurm integration coverage. The tests can derive BISECT_NODELIST from SLURM_NODELIST, omit --partition when no partition is known, and default the fake bad node to the last resolved hostname.
Updated docs/preflight-direct.md with the automated node-bisection workflow and usage example.

Test Plan

Run the pytest integration test file in a real slurm cluster:

(preflight) andrew.ma@amd.com@tus1-vm-amd-prj2-k8s-010:~/Primus$ python3 -m pytest \
  tests/unit_tests/tools/test_preflight_bisect_slurm.py \                       
  -v \
  --basetemp=/tmp/preflight-bisect-pytest

Test Result

Test passes and result appears as expected
The test that a set of healthy nodes can pass takes 3min on 4N and it can take longer on larger sets of nodes.

Screenshot of passing test

File output looks like the expected output 1 test with all passing and one test that identifies the last node as a bad node

…ntainer.md

…rated format for runner compatibility

…oting steps and clarify GPU allocation details Implement parallel execution for bisect trials in bisect.py and update documentation. The script now launches sibling subsets concurrently by default, enhancing performance. Added support for a `--max-concurrent-trials` argument to revert to sequential execution if needed. Updated README and run-preflight-without-container.md to reflect these changes and clarify usage.

…isites and an example template for node bisection. Remove outdated README for preflight_bisect as its content is now integrated into the documentation. Update example commands to enhance clarity and usability.

…oduce detailed instructions in preflight-direct.md, including prerequisites and example usage. Update run-preflight-without-container.md to reference the new bisection tool and remove outdated content. Modify bisect.py to align with the new wrapper and clarify environment variable handling.

…structions and has been superseded by updated documentation. This cleanup aligns with recent enhancements and clarifications made in the preflight documentation.

…ated instructions and has been superseded by updated documentation. This cleanup aligns with recent enhancements and clarifications made in the preflight documentation.

…or bisect validation Add fake_runner.sh as a drop-in bisect.py --runner replacement that simulates a bad node via BAD_NODE env var without running real preflight. Add two opt-in integration tests (test_preflight_bisect_slurm.py) that run bisect.py end-to-end against a live Slurm cluster: one with the real runner to confirm a clean nodeset reports no suspects, and one with the fake runner to confirm a seeded bad node is correctly identified. Tests are skipped unless BISECT_NODELIST/BISECT_PARTITION/etc. are set. Register the slurm pytest mark in conftest.py.

…t.py. The test file was deleted as part of a cleanup effort, and the Slurm marker registration was also removed to streamline the test configuration.

…le handling and documentation. Introduce helper functions for resolving nodelist, partition, and bad node, allowing for more flexible test execution. Update test requirements to clarify optional overrides and streamline usage instructions.

…nce documentation and streamline usage. Revise environment variable instructions, clarify defaults, and provide clearer examples for running bisect tests in Slurm. Ensure consistency in the handling of nodelists and output directories across documentation.

yeandy and others added 14 commits May 8, 2026 14:50

Merge main

4b55e8c

Add automated node bisection instructions to run-preflight-without-co…

abc8407

…ntainer.md

Update README and bisect.py to clarify GPU allocation in srun command

8a2d023

Refactor environment variable handling in bisect.py to use space-sepa…

44a82ea

…rated format for runner compatibility

Revise run-preflight-without-container.md to include detailed prerequ…

84a9c94

…isites and an example template for node bisection. Remove outdated README for preflight_bisect as its content is now integrated into the documentation. Update example commands to enhance clarity and usability.

this is a test commit

528f5b6

Remove run-preflight-without-container.md as it contained outdated in…

f9b5c2a

…structions and has been superseded by updated documentation. This cleanup aligns with recent enhancements and clarifications made in the preflight documentation.

fixup! Remove run-preflight-without-container.md as it contained outd…

0f2038d

…ated instructions and has been superseded by updated documentation. This cleanup aligns with recent enhancements and clarifications made in the preflight documentation.

Remove test_preflight_bisect.py and related Slurm marker from conftes…

c1724e4

…t.py. The test file was deleted as part of a cleanup effort, and the Slurm marker registration was also removed to streamline the test configuration.

amd-ama10002-2 marked this pull request as ready for review May 9, 2026 00:54

amd-ama10002-2 requested review from Xiaoming-AMD, limou102 and wenxie-amd as code owners May 9, 2026 00:54

amd-ama10002-2 self-assigned this May 9, 2026

amd-ama10002-2 added the enhancement New feature or request label May 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Slurm preflight bisection helper#720

Add Slurm preflight bisection helper#720
amd-ama10002-2 wants to merge 14 commits intodev/preflight-direct-testfrom
feat/andrewma/binary_search_nccl_hang_issues-v2

amd-ama10002-2 commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

amd-ama10002-2 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Technical Details

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amd-ama10002-2 commented May 9, 2026 •

edited

Loading