Skip to content

Add NCCL RAS monitoring for distributed training diagnostics#104

Open
asaiacai wants to merge 2 commits into
mainfrom
claude/nccl-ras-log-polling-08hRB
Open

Add NCCL RAS monitoring for distributed training diagnostics#104
asaiacai wants to merge 2 commits into
mainfrom
claude/nccl-ras-log-polling-08hRB

Conversation

@asaiacai
Copy link
Copy Markdown

@asaiacai asaiacai commented May 7, 2026

Description

This PR adds NCCL RAS (Reliability, Availability, and Serviceability) monitoring to capture diagnostic information from the NCCL RAS socket during distributed training runs.

Changes

  • New module pluto/nccl_ras.py: Implements NcclRasMonitor class that:

    • Polls the local NCCL RAS socket on rank 0 only (RAS gossips OOB across all ranks)
    • Runs in a background daemon thread with configurable polling interval
    • Enqueues RAS output through the existing console-log pipeline for centralized logging
    • Handles connection failures gracefully with exponential backoff logging
    • Supports configurable RAS address via NCCL_RAS_ADDR environment variable or settings
  • Modified pluto/op.py:

    • Instantiates NcclRasMonitor in Op.__init__
    • Starts RAS monitor in Op.start() alongside the system metrics monitor
    • Stops RAS monitor first in Op._teardown() to prevent enqueuing after sync store shutdown
  • Modified pluto/sets.py: Added three new settings:

    • x_nccl_ras_enabled: Enable/disable RAS monitoring (default: True)
    • x_nccl_ras_addr: RAS socket address (default: '127.0.0.1:28028')
    • x_nccl_ras_log_type: Log type label for RAS output (default: 'RAS')

Design Notes

  • RAS monitoring is only active on rank 0 to avoid duplicate logging across ranks
  • The monitor gracefully handles cases where NCCL RAS is disabled or not listening
  • Output is integrated into the existing console logging infrastructure via sync_manager.enqueue_console_batch()
  • Thread lifecycle is properly managed with timeout-based join during teardown

Tests

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Existing integration tests pass with RAS monitoring enabled/disabled

https://claude.ai/code/session_01QWDUivZ2pPDgNZTQ6DG9q8

claude added 2 commits May 7, 2026 01:03
NCCL RAS gossips OOB across all ranks, so a single query on rank 0 returns
the full job state. We poll the local RAS TCP socket on the same cadence
as system metrics and route the output through the existing console-log
sync path with logType=RAS so the UI can render it in a separate tab.
…t text

Probe for the ncclras binary at thread start and check its --help for `-f`
+ `json` to confirm JSON support (NCCL >= 2.28.7). When available, run
`ncclras -f json -v` per poll and ship the full JSON document as a single
console-log record. Otherwise fall back to the raw RAS socket with
`verbose status` over TCP.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new NcclRasMonitor to poll NCCL RAS data from a local socket on rank 0 and integrate it into the console-log pipeline. The monitor is managed within the Op lifecycle and includes new configuration settings. Feedback was provided to improve the robustness of the rank-zero detection logic by handling potential conversion errors and consolidating environment variable checks.

Comment thread pluto/nccl_ras.py
Comment on lines +19 to +26
MODE_NCCLRAS_JSON = 'ncclras-json'
MODE_SOCKET_TEXT = 'socket-text'


def _is_rank_zero() -> bool:
"""Return True on the head process. Defaults to True when not distributed."""
for var in ('RANK', 'SLURM_PROCID'):
v = os.environ.get(var)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current method of checking for rank zero is not robust and can lead to an unhandled ValueError if an environment variable contains a string like '--1'. The lstrip('-').isdigit() check can pass, but int() will fail. Using a try-except block for the conversion is a safer and cleaner approach. Additionally, the logic for checking LOCAL_RANK can be merged into the main loop to avoid code duplication.

Suggested change
MODE_NCCLRAS_JSON = 'ncclras-json'
MODE_SOCKET_TEXT = 'socket-text'
def _is_rank_zero() -> bool:
"""Return True on the head process. Defaults to True when not distributed."""
for var in ('RANK', 'SLURM_PROCID'):
v = os.environ.get(var)
for var in ('RANK', 'SLURM_PROCID', 'LOCAL_RANK'):
v = os.environ.get(var)
if v is not None:
try:
return int(v) == 0
except ValueError:
pass # Not a valid integer, try next env var
return True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants