Add Slurm preflight bisection helper#720
Open
amd-ama10002-2 wants to merge 14 commits intodev/preflight-direct-testfrom
Open
Add Slurm preflight bisection helper#720amd-ama10002-2 wants to merge 14 commits intodev/preflight-direct-testfrom
amd-ama10002-2 wants to merge 14 commits intodev/preflight-direct-testfrom
Conversation
…rated format for runner compatibility
…oting steps and clarify GPU allocation details Implement parallel execution for bisect trials in bisect.py and update documentation. The script now launches sibling subsets concurrently by default, enhancing performance. Added support for a `--max-concurrent-trials` argument to revert to sequential execution if needed. Updated README and run-preflight-without-container.md to reflect these changes and clarify usage.
…isites and an example template for node bisection. Remove outdated README for preflight_bisect as its content is now integrated into the documentation. Update example commands to enhance clarity and usability.
…oduce detailed instructions in preflight-direct.md, including prerequisites and example usage. Update run-preflight-without-container.md to reference the new bisection tool and remove outdated content. Modify bisect.py to align with the new wrapper and clarify environment variable handling.
…structions and has been superseded by updated documentation. This cleanup aligns with recent enhancements and clarifications made in the preflight documentation.
…ated instructions and has been superseded by updated documentation. This cleanup aligns with recent enhancements and clarifications made in the preflight documentation.
…or bisect validation Add fake_runner.sh as a drop-in bisect.py --runner replacement that simulates a bad node via BAD_NODE env var without running real preflight. Add two opt-in integration tests (test_preflight_bisect_slurm.py) that run bisect.py end-to-end against a live Slurm cluster: one with the real runner to confirm a clean nodeset reports no suspects, and one with the fake runner to confirm a seeded bad node is correctly identified. Tests are skipped unless BISECT_NODELIST/BISECT_PARTITION/etc. are set. Register the slurm pytest mark in conftest.py.
…t.py. The test file was deleted as part of a cleanup effort, and the Slurm marker registration was also removed to streamline the test configuration.
…le handling and documentation. Introduce helper functions for resolving nodelist, partition, and bad node, allowing for more flexible test execution. Update test requirements to clarify optional overrides and streamline usage instructions.
…nce documentation and streamline usage. Revise environment variable instructions, clarify defaults, and provide clearer examples for running bisect tests in Slurm. Ensure consistency in the handling of nodelists and output directories across documentation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a Slurm-based preflight bisection helper script for isolating suspect nodes when multi-node preflight perf tests fail or hang. The new workflow runs preflight on recursively smaller node subsets, records per-trial logs, and writes a final summary of suspected faulty nodes.
It also adds two pytests for validating the bisection using a real pre-flight check in a slurm allocation
in collaboration with: @akasharidas
Motivation
To reduce that manual effort when debugging NCCL/RCCL hangs or node-specific failures when using our preflight checks
Technical Details
tools/preflight_bisect/bisect.py, which expands a Slurm nodelist, runsrunner/run_preflight_direct.sh --perf-teston subsets viasrun, and recursively bisects failing or hanging subsets until suspect singleton nodes are identified.tools/preflight_bisect/fake_runner.sh, a drop-in test runner that simulates one bad node on a live Slurm allocation without requiring an actually broken host.tests/unit_tests/tools/test_preflight_bisect_slurm.pyfor opt-in live-Slurm integration coverage. The tests can deriveBISECT_NODELISTfromSLURM_NODELIST, omit--partitionwhen no partition is known, and default the fake bad node to the last resolved hostname.docs/preflight-direct.mdwith the automated node-bisection workflow and usage example.Test Plan
Test Result
Screenshot of passing test
File output looks like the expected output 1 test with all passing and one test that identifies the last node as a bad node