Skip to content

Add Slurm preflight bisection helper#720

Open
amd-ama10002-2 wants to merge 14 commits intodev/preflight-direct-testfrom
feat/andrewma/binary_search_nccl_hang_issues-v2
Open

Add Slurm preflight bisection helper#720
amd-ama10002-2 wants to merge 14 commits intodev/preflight-direct-testfrom
feat/andrewma/binary_search_nccl_hang_issues-v2

Conversation

@amd-ama10002-2
Copy link
Copy Markdown
Collaborator

@amd-ama10002-2 amd-ama10002-2 commented May 9, 2026

Summary

This PR adds a Slurm-based preflight bisection helper script for isolating suspect nodes when multi-node preflight perf tests fail or hang. The new workflow runs preflight on recursively smaller node subsets, records per-trial logs, and writes a final summary of suspected faulty nodes.

It also adds two pytests for validating the bisection using a real pre-flight check in a slurm allocation

  1. Test that the bisection passes on a set of healthy nodes in a slurm allocation
  2. Test that bisection can correctly identify an unhealthy node (by temporarily creating a "bad node" on during execution)

in collaboration with: @akasharidas

Motivation

To reduce that manual effort when debugging NCCL/RCCL hangs or node-specific failures when using our preflight checks

Technical Details

  • Added tools/preflight_bisect/bisect.py, which expands a Slurm nodelist, runs runner/run_preflight_direct.sh --perf-test on subsets via srun, and recursively bisects failing or hanging subsets until suspect singleton nodes are identified.
  • Added tools/preflight_bisect/fake_runner.sh, a drop-in test runner that simulates one bad node on a live Slurm allocation without requiring an actually broken host.
  • Added tests/unit_tests/tools/test_preflight_bisect_slurm.py for opt-in live-Slurm integration coverage. The tests can derive BISECT_NODELIST from SLURM_NODELIST, omit --partition when no partition is known, and default the fake bad node to the last resolved hostname.
  • Updated docs/preflight-direct.md with the automated node-bisection workflow and usage example.

Test Plan

  • Run the pytest integration test file in a real slurm cluster:
(preflight) andrew.ma@amd.com@tus1-vm-amd-prj2-k8s-010:~/Primus$ python3 -m pytest \
  tests/unit_tests/tools/test_preflight_bisect_slurm.py \                       
  -v \
  --basetemp=/tmp/preflight-bisect-pytest

Test Result

  • Test passes and result appears as expected
  • The test that a set of healthy nodes can pass takes 3min on 4N and it can take longer on larger sets of nodes.
Screenshot of passing test Screenshot 2026-05-08 190640

File output looks like the expected output 1 test with all passing and one test that identifies the last node as a bad node

Screenshot 2026-05-08 190609

yeandy and others added 14 commits May 8, 2026 14:50
…oting steps and clarify GPU allocation details

Implement parallel execution for bisect trials in bisect.py and update documentation. The script now launches sibling subsets concurrently by default, enhancing performance. Added support for a `--max-concurrent-trials` argument to revert to sequential execution if needed. Updated README and run-preflight-without-container.md to reflect these changes and clarify usage.
…isites and an example template for node bisection. Remove outdated README for preflight_bisect as its content is now integrated into the documentation. Update example commands to enhance clarity and usability.
…oduce detailed instructions in preflight-direct.md, including prerequisites and example usage. Update run-preflight-without-container.md to reference the new bisection tool and remove outdated content. Modify bisect.py to align with the new wrapper and clarify environment variable handling.
…structions and has been superseded by updated documentation. This cleanup aligns with recent enhancements and clarifications made in the preflight documentation.
…ated instructions and has been superseded by updated documentation. This cleanup aligns with recent enhancements and clarifications made in the preflight documentation.
…or bisect validation

Add fake_runner.sh as a drop-in bisect.py --runner replacement that
simulates a bad node via BAD_NODE env var without running real preflight.
Add two opt-in integration tests (test_preflight_bisect_slurm.py) that
run bisect.py end-to-end against a live Slurm cluster: one with the real
runner to confirm a clean nodeset reports no suspects, and one with the
fake runner to confirm a seeded bad node is correctly identified.
Tests are skipped unless BISECT_NODELIST/BISECT_PARTITION/etc. are set.
Register the slurm pytest mark in conftest.py.
…t.py. The test file was deleted as part of a cleanup effort, and the Slurm marker registration was also removed to streamline the test configuration.
…le handling and documentation. Introduce helper functions for resolving nodelist, partition, and bad node, allowing for more flexible test execution. Update test requirements to clarify optional overrides and streamline usage instructions.
…nce documentation and streamline usage. Revise environment variable instructions, clarify defaults, and provide clearer examples for running bisect tests in Slurm. Ensure consistency in the handling of nodelists and output directories across documentation.
@amd-ama10002-2 amd-ama10002-2 marked this pull request as ready for review May 9, 2026 00:54
@amd-ama10002-2 amd-ama10002-2 self-assigned this May 9, 2026
@amd-ama10002-2 amd-ama10002-2 added the enhancement New feature or request label May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants