Add NCCL RAS monitoring for distributed training diagnostics#104
Add NCCL RAS monitoring for distributed training diagnostics#104asaiacai wants to merge 2 commits into
Conversation
NCCL RAS gossips OOB across all ranks, so a single query on rank 0 returns the full job state. We poll the local RAS TCP socket on the same cadence as system metrics and route the output through the existing console-log sync path with logType=RAS so the UI can render it in a separate tab.
…t text Probe for the ncclras binary at thread start and check its --help for `-f` + `json` to confirm JSON support (NCCL >= 2.28.7). When available, run `ncclras -f json -v` per poll and ship the full JSON document as a single console-log record. Otherwise fall back to the raw RAS socket with `verbose status` over TCP.
There was a problem hiding this comment.
Code Review
This pull request introduces a new NcclRasMonitor to poll NCCL RAS data from a local socket on rank 0 and integrate it into the console-log pipeline. The monitor is managed within the Op lifecycle and includes new configuration settings. Feedback was provided to improve the robustness of the rank-zero detection logic by handling potential conversion errors and consolidating environment variable checks.
| MODE_NCCLRAS_JSON = 'ncclras-json' | ||
| MODE_SOCKET_TEXT = 'socket-text' | ||
|
|
||
|
|
||
| def _is_rank_zero() -> bool: | ||
| """Return True on the head process. Defaults to True when not distributed.""" | ||
| for var in ('RANK', 'SLURM_PROCID'): | ||
| v = os.environ.get(var) |
There was a problem hiding this comment.
The current method of checking for rank zero is not robust and can lead to an unhandled ValueError if an environment variable contains a string like '--1'. The lstrip('-').isdigit() check can pass, but int() will fail. Using a try-except block for the conversion is a safer and cleaner approach. Additionally, the logic for checking LOCAL_RANK can be merged into the main loop to avoid code duplication.
| MODE_NCCLRAS_JSON = 'ncclras-json' | |
| MODE_SOCKET_TEXT = 'socket-text' | |
| def _is_rank_zero() -> bool: | |
| """Return True on the head process. Defaults to True when not distributed.""" | |
| for var in ('RANK', 'SLURM_PROCID'): | |
| v = os.environ.get(var) | |
| for var in ('RANK', 'SLURM_PROCID', 'LOCAL_RANK'): | |
| v = os.environ.get(var) | |
| if v is not None: | |
| try: | |
| return int(v) == 0 | |
| except ValueError: | |
| pass # Not a valid integer, try next env var | |
| return True |
Description
This PR adds NCCL RAS (Reliability, Availability, and Serviceability) monitoring to capture diagnostic information from the NCCL RAS socket during distributed training runs.
Changes
New module
pluto/nccl_ras.py: ImplementsNcclRasMonitorclass that:NCCL_RAS_ADDRenvironment variable or settingsModified
pluto/op.py:NcclRasMonitorinOp.__init__Op.start()alongside the system metrics monitorOp._teardown()to prevent enqueuing after sync store shutdownModified
pluto/sets.py: Added three new settings:x_nccl_ras_enabled: Enable/disable RAS monitoring (default: True)x_nccl_ras_addr: RAS socket address (default: '127.0.0.1:28028')x_nccl_ras_log_type: Log type label for RAS output (default: 'RAS')Design Notes
sync_manager.enqueue_console_batch()Tests
Tested (run the relevant ones):
bash format.shhttps://claude.ai/code/session_01QWDUivZ2pPDgNZTQ6DG9q8