Open
Conversation
Use get_or_init to create Profiler only on first init, reuse on subsequent cycles. Always respawn daemon when INIT_FLAG=0 (daemon stops on finalize).
Implement phase-aware telemetry to track NCCL operations by training phase (FORWARD/BACKWARD/OPTIMIZER). Uses reference counting for correct attribution, replacing unreliable PhaseFlush control messages. Architecture: - PhaseScope tracking with atomic reference counting - TelemetryPool: shared memory FIFO for Python client access - Phase API: ncclProfilerBeginPhase/EndPhase with phase IDs - NcclOp-based accounting: capture phase on creation, account on completion Core components: - src/phase_scope.rs: Phase tracking with reference counting - src/phase_api.rs: C FFI for phase control - src/telemetry_pool.rs: Shared memory export mechanism - src/telemetry_ffi.rs: C FFI for telemetry access Changes since v12: - Fixes OPTIMIZER phases showing 0 operations - Fixes first phase duration race condition - Production verified with 8-rank MoE training (11.2GB model, 10 steps).
1. Increase CTRL_FIFO_SZ from 256 to 4096
- Supports more concurrent thread registrations
- 256 was too small for large-scale deployments
(e.g., 64 GPUs x 4 process groups x 3 threads = 768)
2. Handle ctrl_fifo overflow gracefully
- Previously: .unwrap() panics if queue full
- Now: log error and continue (thread loses telemetry)
- Training continues instead of crashing
3. Improve PROFILER.get() error message
- Previously: generic unwrap panic
- Now: clear message explaining the race condition
- Helps debugging if NCCL calls handlers before init completes
92f7186 to
fb13bc4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Work in par with https://github.com/poolsideai/nccl-telemetry-py