[WIP] Enable concurrent Multi-LoRA training and GPU time-sharing for RLix control plane#378
Open
taoluo wants to merge 108 commits intoalibaba:mainfrom
Open
[WIP] Enable concurrent Multi-LoRA training and GPU time-sharing for RLix control plane#378taoluo wants to merge 108 commits intoalibaba:mainfrom
taoluo wants to merge 108 commits intoalibaba:mainfrom
Conversation
- Make Ray namespace configurable via ROLL_RAY_NAMESPACE\n- Keep SharedStorage job-global in GLOBAL_STORAGE_NAMESPACE\n- Add library mode (SCHEDRL_LIBRARY_MODE) to avoid ray stop/shutdown\n- Make master rendezvous key pipeline-scoped and port claims atomic\n- Use SharedStorage-backed free ports for SGLang
- Emit SchedRL progress from GroupQueueManager.put() (train only)\n- Add atomic shrink/expand sequencing with locks\n- Allow shrink-to-zero and handle active_dp_ranks==empty\n- Preserve canonical request metadata across generate postprocess\n- Remove colocated gating for vLLM/strategy offload
Add a small SchedRL adapter module used by driver scripts to register/admit pipelines and launch the concurrent pipeline under a per-pipeline runtime_env.
Use ast.literal_eval instead of eval() when parsing structured config values.
- Add resize_infer() to adapter replacing shrink_workers/expand_workers - Implement pipeline-scoped actor naming with PIPELINE_ID prefix - Add per-pipeline namespace isolation via ROLL_RAY_NAMESPACE env var - Update GlobalCounter, GlobalLimiter, model_update_locker with pipeline_id - Add SCHEDRL_CONTROL_PLANE check to prevent ray.shutdown() in library mode - Implement abort+retry semantics for shrink with proper ACK handling - Add topology validation hooks in worker and strategy initialization - Add multi-pipeline test examples (start_multi_pipeline_test.py) Refs: 2026-02-05-ENG-123-roll-multipipeline-extraction.md
…r GPU mgmt Adapter changes: - Remove _require_ray() pattern, use top-level import ray - Remove PipelineRegistration dataclass and cluster_tp_configs/device_mappings - Simplify API: merge ensure_coordinator + start_pipeline into create_coordinator - resize_infer now returns ActionResponse instead of dict - Add max_concurrency=1000 to coordinator for concurrent RPCs ConcurrentPipeline changes: - Rename _SchedRLAgenticPipeline to SchedRLConcurrentPipeline - Remove wrapper class pattern, methods now on main class - Add GPU request/release for static clusters (critic, reference, actor_train) - Reference model offload during shrink/expand phases RolloutScheduler changes: - Improve error messages for scheduler resolution failures - Fix _estimate_total_required to use rollout_batch_size * num_return_sequences Example changes: - Remove _inject_system_envs (moved to adapter) - Simplify adapter creation flow
- Add schedrl_env_vars() helper to constants.py for consistent env propagation - Pass runtime_env with schedrl env vars to named actors (ExceptionMonitor, GlobalCounter, etc.) - Use ROLL_RAY_NAMESPACE from env instead of hardcoded RAY_NAMESPACE in generate_scheduler - Ensures PIPELINE_ID and control-plane vars are visible in all worker processes
- Use Ray 'GPU' resource key when num_gpus_per_node > 0 for proper bundle scheduling - Fix placement group bundle creation to use ray_device_key consistently - Add CUDA_VISIBLE_DEVICES and RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES to CpuPlatform - Pass runtime_env with schedrl env vars to GlobalLimiter actor
- Cache bucket as raw bytes + metadata instead of pickled tensors - Avoid torch multiprocessing reduction issues with vLLM v1 workers - Promote active checkpoint after building bucket cache for next expand/broadcast - Pass runtime_env with schedrl env vars to model_update_locker actor
…tibility - Add PYTHONPATH to pipeline env vars for Ray worker imports - Refactor concurrent_pipeline with explicit initialize_pipeline() for lazy init - Add build_latest_bucket_cache to Worker wrapper - Set thread limits in cluster env vars to avoid OS limits - Reduce max_concurrency from 1000 to 32 to prevent thread pool exhaustion - Patch vLLM v1 _dummy_run to fix numpy.int64 tensor indexing - Support bucket_bytes format in vLLM worker update_parameter_in_bucket
- Reduce num_gpus_per_node and TP sizes for single-node smoke tests - Add VLLM_USE_V1=1 to vLLM strategy configs - Improve start_multi_pipeline_test.py repo/ROLL root detection - Add runtime_env propagation and local Ray init for smoke tests - Relax Ray version pin; use flash-attn wheel URL for build stability
Adapt ROLL to the refactored SchedRL scheduler that removes completion-driven suspension. Key changes: - Remove planned_release_gpu_ids from notify_ready_to_release calls - Add release_and_request_static_cluster for atomic train->critic GPU handoff - Add RollResourceManagerProxy singleton for shared placement groups across pipelines - Add state verification after shrink/expand operations to catch desync bugs - Fix shutdown to use timeout and handle hanging tasks gracefully - Schedule coordinator in node-0 PG bundle with num_gpus=0.01 for CUDA visibility - Add get_active_dp_ranks() for post-shrink state verification Passes smoke test: python external/ROLL_schedrl/examples/multi_pipeline/start_multi_pipeline_test.py --config_name pipeline1_sokoban_grpo
- init_collective_group and create_collective_group now accept timeout_s so NCCL ProcessGroup init can be bounded (avoids indefinite hangs). - get_group_by_name and destroy_collective_group now raise KeyError instead of silently logging a warning and returning; callers that skip missing groups must handle the exception explicitly. - Add enter/exit logging to init_collective_group for diagnostics. - Add teardown_collective_groups() to InferenceStrategy: batch-destroys named groups and removes their comm-plan bookkeeping entries in one call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… fix ModelUpdateService (model_update_service.py): - Rewrote sync_selected_workers to select one sender per PP rank (dp_rank==0, tp_rank==0, cp_rank==0) via _select_sender_ranks_by_pp(). - _build_comm_plan_for_sender() allocates a dedicated NCCL group per PP rank, excluding colocated targets (same physical GPU) from the NCCL group so they use the IPC path instead. - Pre-setup collective groups on both sender and receivers before issuing sync calls; per-group timeouts via ROLL_SELECTIVE_MODEL_UPDATE_PG_TIMEOUT_S. - Removed finally-block teardown: groups are now destroyed inside selective_sync_active_cache (sender side) BEFORE dist.barrier() to prevent ncclCommDestroy from blocking after the barrier. megatron_strategy.py (selective_sync_active_cache): - Extended signature: model_update_name, comm_plan, is_leader params. - Replaced self-driven group setup with comm_plan-based setup passed from ModelUpdateService; is_leader flag identifies broadcast sender. - Teardown happens BEFORE dist.barrier() inside the sender block to avoid ncclCommDestroy blocking on already-synchronized processes. - Added extensive [schedrl][selective_sync] logging throughout. worker.py: - Added _maybe_await() for calling async strategy methods from sync contexts. - Added destroy_collective_group() and teardown_collective_groups() wrappers. - selective_sync_active_cache() now forwards comm_plan, model_update_name, is_leader to the strategy layer. - process_weights_after_loading() made async with awaitable dispatch. vllm_strategy.py / async_llm_engine.py / third_party/vllm/worker.py: - setup_collective_group() supports both comm_plan (dynamic) and legacy (master_address/port) call styles. - Added teardown_collective_groups() to VllmStrategy. - Added await to all collective_rpc() calls in CustomAsyncLLMEngine. - Added enter/exit logging and timeout_s threading for vllm worker collective. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…async actors
Cluster.__init__ now accepts resolve_topology=True (default).
When False, rank2devices and worker2nodes are set to {} without issuing any
ray.get() calls, and master addr/port resolution is also skipped.
This is required for RolloutScheduler, which is an async Ray actor: calling
blocking ray.get() inside an async actor's __init__ triggers Ray's
"Using blocking ray.get inside async actor" warning and can stall the event
loop during startup. The env Cluster created by RolloutScheduler does not need
topology info, so resolve_topology=False is safe there.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nd order generate_scheduler.py: - In SchedRL mode (SCHEDRL_CONTROL_PLANE=schedrl), reordered expand steps: sync_selected_workers → process_weights_after_loading → load_states_partial. Previously load_states_partial ran before sync, which held KV allocations during weight sync and caused transient OOM. - Non-SchedRL path unchanged: load_states_partial only. - Added per-request dispatch logging (request_id, dp_rank, global_step). - Added slow-request warning (>= 30 s) for generate_one_request. resource_manager.py (RollResourceManagerProxy): - allocate_placement_group() is now computed locally from cached PG state without issuing a remote ray.get(). The previous implementation blocked the async actor's event loop during RolloutScheduler construction. rollout_scheduler.py: - env Cluster created with resolve_topology=False (no blocking calls in ctor). - es_manager.initialize() submitted as non-blocking refs (_es_initialize_refs); awaited lazily in get_batch() on first call (_es_initialized flag). - get_batch() now has a configurable timeout (ROLL_ROLLOUT_GET_BATCH_TIMEOUT_S, default 1800 s) to fail fast instead of hanging indefinitely. - get_active_dp_ranks() made async (await instead of ray.get()). - Added detailed INFO logging at each construction and rollout phase. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… residual OOM
Root cause: megatron_strategy.save_checkpoint() calls load_states() internally
to read model weights before saving, but never calls offload_states() afterward.
When _release_static_cluster() ran immediately after do_checkpoint(), the
scheduler saw the GPU as idle while 4+ GiB of model params remained physically
loaded. Peer pipelines then requested the same GPUs and hit vLLM KV-cache OOM.
Fix:
- Introduce should_checkpoint (mirrors existing save_steps/max_steps condition)
and defer_actor_train_release_for_checkpoint.
- For ALL checkpoint steps: set defer=True so do_checkpoint() runs first, then
call offload_states() to flush the weights load_states() left behind.
- GPU release logic:
- Intermediate checkpoint steps: offload only; keep GPU allocated so the next
step's Phase 4 can do an atomic release_and_request.
- Last step: offload + _release_static_cluster (no next Phase 4 will run).
Additional improvements in run():
- Phase 0: if previous step's notify_ready was missed (e.g. validation path),
send it at the start of the next step (last_notify_ready_step guard).
- Phase 3: model_update() removed from the pipeline loop; selective model update
is now entirely scheduler-driven via resize_infer/expand.
- Phase 4: for step>0, use atomic _release_and_request_static_cluster to hand
off actor_train GPUs to actor_infer in a single scheduler round-trip, avoiding
a window where both clusters compete for the same physical GPUs.
- Renamed _request_and_expand_actor_infer → _request_actor_infer_gpus.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nfig updates adapter.py: - resize_infer changed from async to sync (ray.get instead of asyncio.wrap_future). SchedRL calls this from a sync context. environment_worker.py: - Use asyncio.get_running_loop() instead of deprecated get_event_loop() (called inside an async def where a running loop always exists). - Guard ThreadPoolExecutor against max_workers=0 when env_managers is empty. - pool.shutdown(wait=False): threads already finished when gather() returns. policy_proxy.py / base_worker.py / traj_env_manager.py: - Added per-request INFO/WARNING logging (schedrl_request_id, src_rank, global_step, elapsed_s) for end-to-end request tracing across proxy, infer worker, request scheduler, and env manager. examples/multi_pipeline/ (pipeline1/2 sokoban YAML): - Output and checkpoint dirs moved to /tmp/roll_output/ to avoid writing to the repo workspace. - Added ROLL_SELECTIVE_MODEL_UPDATE_PG_TIMEOUT_S and ROLL_ROLLOUT_GET_BATCH_TIMEOUT_S env vars. - offload_nccl: true to free NCCL handles between uses. - gpu_memory_utilization lowered to 0.65 (was 0.7) to account for Megatron residual memory during multi-pipeline overlap. - pipeline2: reduced batch/env sizes (rollout_batch_size=4, sequence_length=2048) for lighter smoke-test runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove per-step notify_ready_to_release call and instead perform a final cleanup release at the end of the pipeline. This aligns with ROLL_multi_pipeline pattern and allows the scheduler to control the entire step lifecycle.
Port routing utilities from ROLL_multi_lora with ROLL_schedrl adaptation. Supports both `domain` (ROLL_schedrl) and `lora_name` (ROLL_multi_lora) conventions for adapter routing.
Add `adapters: Dict[str, LoraArguments]` to ModelArguments to support per-adapter LoRA configurations in multi-LoRA training scenarios.
Implement lora_optimizer_mode ('shared' | 'per_adapter') in MegatronTrainStrategy:
- shared mode: single optimizer for all adapters (existing behavior)
- per_adapter mode: dedicated optimizer + scheduler per adapter
New methods:
- zero_grad(), forward_backward_only(), optimizer_step_only()
- train_step_lora() with adapter routing via domain/lora_name
- get_lora_tensors(), set_lora_tensors(), copy_lora_params() for weight mgmt
Modified load_states/offload_states for per_adapter mode compatibility.
Ensures adapter isolation: N mixed-domain steps == N single-domain steps.
Expose per-adapter multi-LoRA functionality in SFTWorker: - train_step_lora(): multi-LoRA training step with adapter routing - get_lora_tensors()/set_lora_tensors(): read/write adapter weights - copy_lora_params(): in-place parameter copy between adapters All methods use ONE_TO_ALL dispatch for TP/DP consistency.
Verify correctness claim: N mixed-domain microbatches in one call produce identical parameter updates to N separate single-domain calls. Tests: - Multi-adapter gradient accumulation and optimizer stepping - Weight equivalence after independent training steps - Adapter isolation via domain-based routing
…eep levels - vllm_strategy/sglang_strategy: restore is_actor_infer_colocated or DO_TIME_SHARING guard in offload_states so dedicated inference workers are not incorrectly slept - vllm/worker: add TensorLoraManager.register(); defer _lora_names update until after vLLM confirms add_lora success, so phantom ids are never tracked - vllm/worker: extend LoRA cache eviction to sleep_level=1 (was level=2 only); LoRA tensors use default CuMem tag so level=1 discards them too - fsdp2/model_update: pass explicit adapter_name="default" to add_lora.remote() to match the new required parameter signature
Always use CUDA IPC for model-update serialization. Containers must
provide --ipc=host or --cap-add SYS_PTRACE; blocked IPC now fails fast.
- Remove _cuda_ipc_available global and _probe_cuda_ipc() from send_recv_utils.py
- Remove bucket_bytes deserialization branch from worker.update_parameter_in_bucket
- serialize_named_weights always emits {"bucket": tensor, "tensors_meta": ...}
- worker.py deserializes single format; calls named_tensors_from_bucket with explicit kwargs
…ns, and clean up megatron strategy - Extract _translate_offload_include helper to remove duplicated OffloadStateType mapping - Use is_lora_optimizer_isolated instead of has_multi_adapter for load/offload conditions - Add monkey_patch_torch_reductions before IPC serialize for UUID-aware device pickling - Rename _safe_dist_barrier param to subgroup, add docstring - Remove dead code in _separated_model_update (unused co_infer_rank) - Rename test to test_isolated_single_lora_step_equivalence - Clean up agentic_multi_lora_pipeline, worker, and strategy files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd add multi-LoRA fail-fast guards - Branch set_adapter on is_mca: Megatron activates all adapters for grad buffer allocation, non-Megatron activates only the first adapter to match upstream single-adapter semantics - Add fail-fast guards in DeepSpeed collect_lora_params and FSDP2 WeightUpdater.__init__ before any silent single-adapter export - Fix is_lora derivation in DS/FSDP2 setup_model_update to use adapters (not lora_target) so explicit multi-LoRA configs are recognized - Replace defensive getattr with direct LoraArguments attribute access - Add docstring noting pure target-module resolution vs upstream mutation - Extend regex detection chars in _resolve_lora_target_modules and _normalize_adapters to match full set Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e_lora_name helper Extract the identical ~13-line LoRA adapter routing block (repeated 10 times across 5 env_manager files) into a single _resolve_lora_name() method on TrajEnvManager. The helper accepts an explicit tag parameter to support the placeholder rollout path which uses env_config tag instead of rollout_cache tag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix nested metrics bug: _shrink_workers/_expand_workers now discard val metrics instead of nesting them as val_result dict - Move target_gpus_to_dp_ranks helpers from standalone functions to BasePipeline methods, removing duplicate params - Move _broadcast_non_tensor_batch responsibility to callers (pipelines) with fail-fast guards in workers - Add PPO epoch loop to train_step_lora matching train_step behavior - Use actual backward step count instead of static formula - Remove debug timing/logging from InferWorker and PolicyProxy - Simplify environment_worker async event loop handling - Remove rlix-specific init offloading and rlix_env_vars() usage - Fix inverted load_states_partial condition in InferWorker Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…l sync, and pipeline cleanup - Forward *args/**kwargs in Worker.offload_states/load_states (root cause: include= was silently dropped) - Remove dead include=OffloadStateType.other_params from all pipeline callers (intent was always full offload) - Multi-LoRA: sync all adapters after full offload (not just dirty ones), add per-LoRA trackers, per-LoRA validation, checkpoint resume, val rollout schedulers, batch_balance, and GAE guard - Simplify partial_gpu_mode config validation and shrink/expand into reusable helpers - Add ensure_min_traj_per_env to AgenticConfig, fix base_pipeline checkpoint is_last_step - Add create_lora_tracker factory to tracking.py for per-adapter W&B/TB/Swanlab runs - Update example yaml configs with wandb tracker settings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lora Currently model_update suspends ALL rollout schedulers and syncs ALL adapters. Added TODO to track the improvement: only abort the just-trained adapter's in-flight requests and sync its weights alone.
… crash
Bug 1: process_weights_after_loading deadlock
- InferWorker inherited sync process_weights_after_loading from base Worker,
which used _maybe_await → spawned background thread with asyncio.run() →
vLLM's ZMQ IPC client bound to main event loop couldn't work from that thread
- Fix (base_worker.py): Added async override of process_weights_after_loading
in InferWorker that awaits the strategy coroutine directly on the actor loop
Bug 2: destroy_collective_group deadlock (same root cause as Bug 1)
- InferWorker did not override destroy_collective_group, falling through to
sync Worker._maybe_await path which deadlocks collective_rpc_async
- Fix (base_worker.py): Added async override of destroy_collective_group in
InferWorker matching the pattern of all other async method overrides
Bug 3: Multi-LoRA broadcast OOM
- GPU1 (broadcast-only worker) OOM during add_lora — each adapter called
load_states() (weights + KV cache), consuming too much memory
- Fix: Added wake_after_add flag — non-final adapters call reload_model()
(weights only), final adapter calls load_states() (weights + KV cache)
- model_update.py: pass wake_after_add to colocated and broadcast add_lora
- worker.py: custom_add_lora conditionally wakes based on wake_after_add
- vllm_strategy.py: add_lora accepts/passes wake_after_add, removed
post-registration visibility RPCs that caused reentrancy stalls
Bug 4: load_states ordering deadlock
- reset_prefix_cache() called before model.load_states() when
is_model_in_gpu=False could block indefinitely on uninitialized engine
- Fix (vllm_strategy.py): Moved reset_prefix_cache() after model.load_states()
Bug 5: _expand_workers signature mismatch (TypeError crash)
- AgenticMultiLoraPipeline._expand_workers() missing train_skip_load param
- Fix (agentic_multi_lora_pipeline.py): Added train_skip_load parameter
Bug 6: Missing multi-adapter validation in AgenticPipeline
- Multi-adapter config silently ran on AgenticPipeline giving wrong step count
- Fix (agentic_pipeline.py): Added is_multi_lora check raising RuntimeError
Also: Added module-level is_group_exist() to collective.py, added diagnostic
logging to selective_sync_active_cache, uncommented pipeline_cls in
agentic_val_sokoban_mulit_lora_partial_overlap.yaml config.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…transfer CUDA IPC requires CAP_SYS_PTRACE on Linux 5.6+, which restricted containers (RunPod, Vast.ai) lack. Add model_update_transport config field with "cpu_pickle" option that serializes weight buckets via standard pickle on CPU, bypassing CUDA IPC entirely. Covers both normal colocated path (serialize_named_weights) and rlix selective sync path (_transport_bucket_sequence). Receiver uses pickle.loads() which handles both cuda_ipc and cpu_pickle payloads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verify weights after transfer by comparing sender-side sum/max/min stats against receiver-side live state. Base model uses end-to-end verification (live GPU parameters with TP aggregation). LoRA uses transport-level verification (raw received HF-format tensors from _staged_weights). Covers all three sync paths: colocated model update, separated model update (with PP-stage aggregation), and selective sync. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Gate all post-sync weight verification behind a per-pipeline YAML flag. When False (default), sender-side stats computation and receiver-side verification RPCs are both skipped — zero overhead on the sync path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move compute_weight_stats() call from cpu_named_weights (after .to(cpu)) to hf_named_weights (GPU tensors). GPU reductions are ~20-40x faster than CPU for large models (e.g. ~90ms vs ~3s for 30B). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ht double-counting Include env_manager_config.name in RequestScheduler actor name so per-tag train schedulers don't collide. Use remove_duplicate=True (default) for named_parameters so tied weights are counted once, matching gather_all_hf_weights. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iction mismatch add_lora_count increments per build_request call, producing different int_ids for the same adapter across sync cycles. vLLM's LRU cache then evicts old IDs while _lora_names holds the latest, causing verify_model failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add rlix_compat.py compatibility layer with try/except imports and fallback implementations. ROLL now works standalone without rlix installed; RLix features activate automatically when rlix is present. Raises clear RuntimeError if RLIX_CONTROL_PLANE=rlix without rlix.
Replace pickle with torch.save/torch.load for ~1.6x serialization speedup on large tensors. Add pinned memory DMA for GPU↔CPU copies (~10x faster than pageable, 270ms vs 2.7s at 3.4GB on PCIe 4.0). Rename cpu_pickle → cpu_serialize across all configs and code. Thread model_update_transport parameter through vLLM call chain so the receiver knows which deserializer to use. Combined improvement: 16s → 7.2s (2.2x) for 1.5B model weight sync. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Decouple progress tracking from queue advancement to match the "next batch only" contract for RLix scheduling. - Remove progress reset from advance_step() and clear() — these are queue operations, not request boundaries - Remove init-time progress emit — no demand before first get_batch() - Add begin_progress_batch(): activates tracking, resets counters, emits new_batch report to coordinator - Add end_progress_batch(): sets _progress_active=False to suppress late put() emissions, then clears coordinator stream - Guard _maybe_emit_progress() with _progress_active flag - Wire begin/end into RolloutScheduler.get_batch() with try/finally to ensure deactivation on success, empty batch, and exception paths - Await end_progress_batch (not fire-and-forget) to serialize with next begin call and prevent lifecycle races on max_concurrency>1 GQM Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Match rlix protocol cleanup: remove queued_trajectories, inflight_trajectories, percent_completed, and oldest_unfinished_creation_ts from ProgressReport constructor. Local percent_completed computation kept for GQM bucket gating. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace wire-level `remaining` with raw `collected` (unclamped trajectory count) in the progress report metrics. The downstream coordinator now derives clamped `completed` and the scheduler derives `remaining` internally, so the producer no longer needs to send either. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces two major features: Multi-LoRA training and GPU time-sharing support for RLix (https://github.com/rlops/rlix).
Multi-LoRA training
This extends the training stack to support multiple LoRA adapters concurrently, similar in spirit to Twinkle's (https://github.com/modelscope/twinkle) multi-tenant LoRA training model.
Changes included in this PR:
train_step_lora()RPC in the Megatron strategy, with per-adapter optimizer, scheduler, and RNG isolation.LoRARequestrouting, with correct per-adapter KV-cache segmentation.AgenticMultiLoraPipelinethat extends the existing partial-overlapping scheduling to the async mutli-lora training.RLix GPU time-sharing mode
This PR also adds
DO_TIME_SHARINGsupport for the RLix control-plane protocol.Changes included in this PR:
latestandactive) built on the cache-owner rank after each training tick._infer_resize_lock.RLIX_AVAILABLE, allowing ROLL to run without the RLix control plane.