Draft
Conversation
Adds the faithful DiLoCo design contract for Plato, including the server-side outer optimizer sign convention, exact local-step H semantics, small-H sampling requirements, client-local optimizer and scheduler state ownership, and the implementation dependency graph.\n\nCovers Linear issue DT-408.
Adds the DiLoCo aggregation strategy with server-side SGD, momentum SGD, and Nesterov outer optimizer behavior over Plato-style client deltas. Covers uniform and sample-weighted aggregation, validates configuration values, and adds focused tests for sign handling, FedAvg equivalence under matching weighting, momentum state persistence, reset, and stale-key cleanup.\n\nValidation reported by worker:\n- uv run pytest tests/servers/test_diloco_strategy.py\n- uv run pytest tests/servers/test_fedavg_strategy.py\n- uv run ruff check . --select I\n\nCovers Linear issue DT-410.
Adds trainer.local_steps_per_round support to ComposableTrainer so local work can stop after an exact number of completed optimizer steps, including mid-epoch DiLoCo-style runs. The trainer counts optimizer steps rather than raw batches, avoids finalization after the limit is reached, preserves existing epoch behavior when unset, and adds focused tests for delayed optimizer stepping, cleanup, and invalid values.\n\nValidation reported by worker:\n- uv run pytest tests/trainers/test_composable_trainer.py -k local_steps\n- uv run pytest tests/trainers/test_composable_trainer.py\n- uv run ruff check . --select I\n\nThe broader ============================= test session starts ==============================
platform darwin -- Python 3.13.12, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/bli/Playground/plato
configfile: pyproject.toml
plugins: anyio-4.13.0
collected 106 items / 1 error / 75 deselected / 31 selected
==================================== ERRORS ====================================
_______ ERROR collecting tests/trainers/test_dp_data_loader_strategy.py ________
ImportError while importing test module '/Users/bli/Playground/plato/tests/trainers/test_dp_data_loader_strategy.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../../.local/share/uv/python/cpython-3.13.12-macos-aarch64-none/lib/python3.13/importlib/__init__.py:88: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/trainers/test_dp_data_loader_strategy.py:4: in <module>
from plato.trainers.diff_privacy import DPDataLoaderStrategy
plato/trainers/diff_privacy.py:15: in <module>
from opacus import GradSampleModule
E ModuleNotFoundError: No module named 'opacus'
=========================== short test summary info ============================
ERROR tests/trainers/test_dp_data_loader_strategy.py
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
======================= 75 deselected, 1 error in 0.11s ======================== collection is blocked by missing optional dependency opacus in unrelated DP tests.\n\nCovers Linear issue DT-416.
DT-417 review found that the built-in GradientAccumulationStepStrategy did not publish optimizer_step_completed, so local_steps_per_round counted raw batches instead of optimizer steps when accumulation_steps > 1. Set optimizer_step_completed only when the accumulation strategy actually performs optimizer.step(), and add a regression test that uses the real built-in accumulation strategy with H=2 and accumulation_steps=3. Validation: uv run pytest tests/trainers/test_composable_trainer.py -k local_steps; uv run pytest tests/trainers/test_composable_trainer.py; uv run ruff check . --select I.
DT-412 makes the DiLoCo outer optimizer apply to trainable floating parameters by default while preserving full state_dict safety for frozen parameters and buffers. The new apply_outer_optimizer_to option supports parameters and all_floating modes, validates unsupported values clearly, resolves trainable parameter names from context.trainer.model for the default mode, and keeps momentum state only for tensors that receive outer optimization. Tests cover trainable parameters, frozen parameters, floating buffers, integer and boolean buffers, all_floating behavior, missing model context, invalid config values, and the existing DiLoCo aggregation math. Validation: uv run pytest tests/servers/test_diloco_strategy.py; uv run pytest tests/servers/test_fedavg_strategy.py; uv run ruff check . --select I.
DT-413 review found that default parameter eligibility missed trainable adapter parameters when adapter payload keys omit PEFT adapter-name segments, such as lora_A.weight versus lora_A.default.weight. Resolve trainable payload aliases from model adapter metadata and intersect them with the actual floating payload leaves, so exact state_dict keys still work while PEFT-style adapter payloads receive outer optimization and momentum. Added a PEFT-like regression test that fails without the alias mapping and verifies the payload key receives SGDM scaling and momentum state. Validation: uv run pytest tests/servers/test_diloco_strategy.py -k peft_adapter -q; uv run pytest tests/servers/test_diloco_strategy.py; uv run pytest tests/servers/test_fedavg_strategy.py; uv run ruff check . --select I.
DT-413 re-review found that adapter-name aliasing could include a separate floating payload key when the exact trainable parameter key was also present. Make exact payload key matches take precedence over adapter-name removal, so alias candidates are only considered when the original trainable parameter name is absent from the payload. Added a negative collision regression to keep unrelated payload keys on the plain averaged-delta path. Validation: uv run pytest tests/servers/test_diloco_strategy.py -k "adapter_payload_names or alias_collisions" -q; uv run pytest tests/servers/test_diloco_strategy.py; uv run pytest tests/servers/test_fedavg_strategy.py; uv run ruff check . --select I.
DT-418 adds trainer.preserve_optimizer_state for the PyTorch ComposableTrainer in-process path so client-local AdamW and scheduler state survive communication rounds without entering client-server payloads. The trainer caches optimizer and scheduler state per logical client, restores it after creating the next round optimizer/scheduler, and discards cached state when optimizer type, scheduler type, parameter names, shapes, dtypes, or optimizer parameter ordering no longer match. Focused tests cover AdamW moment persistence, scheduler LR progress, logical-client isolation, payload locality, disabled behavior, optimizer changes, parameter-order changes, and shape/dtype/scheduler compatibility rejection. Validation: uv run pytest tests/trainers/test_composable_optimizer_state.py -q; uv run pytest tests/trainers/test_composable_trainer.py -q; uv run pytest tests/trainers -k optimizer_state-or-scheduler_state-or-composable -q --ignore=tests/trainers/test_dp_data_loader_strategy.py; uv run ruff check . --select I. The unignored trainer selector still hits the repo optional opacus collection dependency in tests/trainers/test_dp_data_loader_strategy.py.
DT-414 adds server.type=diloco as a FedAvg-compatible server that injects DiLoCoAggregationStrategy while keeping algorithm.type=fedavg. The FedAvg delta path now filters non-weight reports before compute_weight_deltas(), so feature or metrics payloads cannot crash delta-only strategies before strategy eligibility handling. DiLoCo remains on aggregate_deltas and does not use inherited direct weight aggregation. Server-level tests cover registry/config selection, delta-path dispatch, inherited aggregate_weights avoidance, non-weight payload filtering, and existing FedAvg delta-strategy behavior. Validation: uv run pytest tests/servers/test_diloco_strategy.py -q; uv run pytest tests/servers/test_fedavg_strategy.py -q; uv run ruff check . --select I.
DT-420 extends trainer.preserve_optimizer_state to ComposableTrainer subprocess training by using a local optimizer-state sidecar under the configured model path. Child training loads any preserved sidecar before train_model(), saves updated optimizer and scheduler state after training, and the parent reloads the sidecar after the trained model is loaded. Missing, unreadable, invalid, or incompatible state falls back to fresh optimizer/scheduler state with explicit logging. Tests cover parent reload, optimizer state persistence across two subprocess rounds, scheduler progress, invalid sidecar reset, disabled behavior, and payload non-leakage. State remains local and is not added to network payloads. Validation: uv run pytest tests/trainers/test_composable_optimizer_state.py -k "subprocess and (optimizer_state or scheduler_state)" -q; uv run pytest tests/trainers/test_composable_optimizer_state.py -q; uv run pytest tests/trainers/test_composable_trainer.py -q; uv run pytest tests/trainers -k "subprocess and (optimizer_state or scheduler_state)" -q --ignore=tests/trainers/test_dp_data_loader_strategy.py; uv run ruff check . --select I; git diff --check.
DT-421 review found that a missing optimizer sidecar could leave inherited parent cache active in the child, and that parent reload could confuse stale input with current child output. The child now clears inherited cache when the input sidecar is missing. Subprocess training writes to a unique child output sidecar, the parent loads that output, promotes it to the stable input sidecar for the next round, and removes stale stable state if child output is missing or invalid. Added regressions for missing input sidecars clearing inherited cache and missing child output removing stale input sidecars. Validation: uv run pytest tests/trainers/test_composable_optimizer_state.py -k "subprocess or sidecar" -q; uv run pytest tests/trainers/test_composable_optimizer_state.py -q; uv run pytest tests/trainers/test_composable_trainer.py -q; uv run pytest tests/trainers -k "subprocess and (optimizer_state or scheduler_state)" -q --ignore=tests/trainers/test_dp_data_loader_strategy.py; uv run ruff check . --select I; git diff --check.
DT-422 adds regression tests proving client-local optimizer and scheduler state remains local when trainer.preserve_optimizer_state is enabled. Client tests now cover the FedAvg/DiLoCo-compatible in-process path and subprocess sidecar path, asserting outbound payloads contain exactly model state tensors and reject optimizer_state, scheduler_state, global_step, local metadata, and sidecar filename keys. Trainer tests also verify model-update payloads stay model-only while optimizer and scheduler state are persisted locally. Validation: uv run pytest tests/clients -k "payload or simple" -q; uv run pytest tests/trainers -k "optimizer_state or scheduler_state" -q --ignore=tests/trainers/test_dp_data_loader_strategy.py; uv run ruff check . --select I; git diff --check.
DT-428 prevents exact local-step training from replaying the same deterministic sampler prefix when H is smaller than one epoch and the train loader is recreated each round. The data-loader strategies now materialize supported sampler streams only when trainer.local_steps_per_round is active, rotate the stream by the deterministic round offset, and leave epoch-based training unchanged when local-step limits are unset. Unsupported non-materializable sampler objects log a clear warning and fall back unchanged. Added focused red/green coverage showing two short local-step rounds for the same client consume different prefixes while repeated runs with the same round sequence remain deterministic. Validation: uv run pytest tests/trainers -k "local_steps or data_loader or sampler" -q; uv run pytest tests/samplers -q; uv run ruff check . --select I; git diff --check.
DT-429 review found that the round-aware local-step sampler wrapper only treated TypeError as an unsupported materialization path. Samplers that raise NotImplementedError during iteration should also warn and fall back unchanged instead of failing while setting up the data loader. This patch catches NotImplementedError in the same warning/fallback path and adds regression coverage with a non-materializable sampler to verify the warning and unchanged sampler handoff. Validation: uv run pytest tests/trainers/test_composable_trainer.py -k "non_materializable or local_steps" -q; uv run pytest tests/trainers -k "local_steps or data_loader or sampler" -q; uv run pytest tests/samplers -q; uv run ruff check plato/trainers/strategies/data_loader.py tests/trainers/test_composable_trainer.py --select I; git diff --check.
DT-424 adds a small MNIST/LeNet DiLoCo smoke config that uses the faithful configuration contract: server.type=diloco, algorithm.type=fedavg, local_steps_per_round=2, preserve_optimizer_state=true, AdamW inner optimizer, Nesterov outer optimizer, uniform weighting, and parameter-only outer updates. The docs now explain how to run the smoke config, distinguish algorithm mechanics from reproducing the paper C4/model/pretraining setup, and document H semantics, mid-epoch stopping, round-aware small-H sampling, local-only optimizer and scheduler state, FedAvg equivalence conditions, and the parameter/buffer policy. The integration smoke test loads the real config, verifies the contract values, and checks that the server registry selects the DiLoCo server and DiLoCo aggregation strategy. Validation: uv run pytest tests/integration/test_smoke_configs.py -k diloco -q; uv run ruff check . --select I; git diff --check.
DT-426 adds final integration coverage for the faithful DiLoCo path using the exact MNIST smoke config. The test builds the configured DiLoCo server and simple client, runs local training with local_steps_per_round=2 and preserved optimizer state, verifies the outbound client payload remains model weights only, and processes deterministic server updates through the DiLoCo delta aggregation path. The validation would fail if the config selected ordinary FedAvg server aggregation, if local step control were ignored, or if the server bypassed aggregate_deltas. It directly checks the Nesterov outer update differs from ordinary FedAvg averaging, while relying on reviewed lower-level tests for small-H mid-epoch stopping, round-aware sampler non-replay, scheduler sidecar persistence, and broader payload leak coverage. Validation: uv run pytest tests/integration/test_smoke_configs.py -k diloco -q; uv run pytest tests/servers/test_diloco_strategy.py -q; uv run pytest tests/trainers -k "local_steps or optimizer_state or scheduler_state or data_loader" -q; uv run pytest tests/clients -k "payload or simple" -q; uv run ruff check . --select I; git diff --check.
✅ Deploy Preview for platodocs canceled.
|
Preserved AdamW state was loaded before ComposableTrainer moved the model to the trainer device. On GPU, PyTorch therefore mapped restored optimizer tensors to CPU and later optimizer.step() saw CUDA parameters with CPU Adam state, producing mixed-device runtime errors in later rounds. Move the model to the trainer device before optimizer construction and preserved-state restore so optimizer.load_state_dict() maps state tensors onto the same device as the optimizer parameters. Add a regression test that fails if preserved optimizer state is restored before model.to(). Validation: uv run pytest tests/trainers/test_composable_optimizer_state.py -k "restores_after_model_moves_to_device or optimizer_state or scheduler_state" -q; uv run pytest tests/trainers -k "local_steps or optimizer_state or scheduler_state or data_loader" -q; uv run pytest tests/integration/test_smoke_configs.py -k diloco -q; uv run pytest tests/clients -k "payload or simple" -q; uv run ruff check . --select I; git diff --check.
Emit a server-side info log each time DiLoCo applies the configured outer optimizer to averaged client deltas. The log includes optimizer settings, aggregation weighting, apply policy, eligible update count, and optimized tensor count so runs show where the server update occurs. Validation: - uv run pytest tests/servers/test_diloco_strategy.py -q - uv run pytest tests/integration/test_smoke_configs.py -k diloco -q - uv run ruff check plato/servers/strategies/aggregation/diloco.py tests/servers/test_diloco_strategy.py --select I - git diff --check Co-authored-by: Codex <codex@openai.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements DiLoCo in Plato. The implementation keeps
algorithm.type = "fedavg"for the existing weight extraction and model-loading path, whileserver.type = "diloco"selects a FedAvg-compatible server that applies the DiLoCo outer optimizer to averaged client deltas.global_before - client_afterwith SGD, momentum SGD, and Nesterov supportHas completed client optimizer steps, not epochs or raw batchesHsmaller than one epoch