Skip to content

Added support for DiLoCo.#412

Draft
baochunli wants to merge 21 commits intomainfrom
diloco-faithful-implementation
Draft

Added support for DiLoCo.#412
baochunli wants to merge 21 commits intomainfrom
diloco-faithful-implementation

Conversation

@baochunli
Copy link
Copy Markdown
Collaborator

@baochunli baochunli commented Apr 29, 2026

Summary

This PR implements DiLoCo in Plato. The implementation keeps algorithm.type = "fedavg" for the existing weight extraction and model-loading path, while server.type = "diloco" selects a FedAvg-compatible server that applies the DiLoCo outer optimizer to averaged client deltas.

  • server-side outer optimizer over global_before - client_after with SGD, momentum SGD, and Nesterov support
  • exact local work H as completed client optimizer steps, not epochs or raw batches
  • client-local optimizer and scheduler state persistence, including subprocess handoff
  • model-only client/server payloads; optimizer and scheduler state stay local
  • parameter-only outer optimizer policy by default, with conservative buffer synchronization
  • round-aware supported sampler streams for H smaller than one epoch
  • exact smoke config and docs for running the Plato mechanics without claiming paper-scale C4/model/pretraining reproduction

Adds the faithful DiLoCo design contract for Plato, including the server-side outer optimizer sign convention, exact local-step H semantics, small-H sampling requirements, client-local optimizer and scheduler state ownership, and the implementation dependency graph.\n\nCovers Linear issue DT-408.
Adds the DiLoCo aggregation strategy with server-side SGD, momentum SGD, and Nesterov outer optimizer behavior over Plato-style client deltas. Covers uniform and sample-weighted aggregation, validates configuration values, and adds focused tests for sign handling, FedAvg equivalence under matching weighting, momentum state persistence, reset, and stale-key cleanup.\n\nValidation reported by worker:\n- uv run pytest tests/servers/test_diloco_strategy.py\n- uv run pytest tests/servers/test_fedavg_strategy.py\n- uv run ruff check . --select I\n\nCovers Linear issue DT-410.
Adds trainer.local_steps_per_round support to ComposableTrainer so local work can stop after an exact number of completed optimizer steps, including mid-epoch DiLoCo-style runs. The trainer counts optimizer steps rather than raw batches, avoids finalization after the limit is reached, preserves existing epoch behavior when unset, and adds focused tests for delayed optimizer stepping, cleanup, and invalid values.\n\nValidation reported by worker:\n- uv run pytest tests/trainers/test_composable_trainer.py -k local_steps\n- uv run pytest tests/trainers/test_composable_trainer.py\n- uv run ruff check . --select I\n\nThe broader ============================= test session starts ==============================
platform darwin -- Python 3.13.12, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/bli/Playground/plato
configfile: pyproject.toml
plugins: anyio-4.13.0
collected 106 items / 1 error / 75 deselected / 31 selected

==================================== ERRORS ====================================
_______ ERROR collecting tests/trainers/test_dp_data_loader_strategy.py ________
ImportError while importing test module '/Users/bli/Playground/plato/tests/trainers/test_dp_data_loader_strategy.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
../../.local/share/uv/python/cpython-3.13.12-macos-aarch64-none/lib/python3.13/importlib/__init__.py:88: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/trainers/test_dp_data_loader_strategy.py:4: in <module>
    from plato.trainers.diff_privacy import DPDataLoaderStrategy
plato/trainers/diff_privacy.py:15: in <module>
    from opacus import GradSampleModule
E   ModuleNotFoundError: No module named 'opacus'
=========================== short test summary info ============================
ERROR tests/trainers/test_dp_data_loader_strategy.py
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
======================= 75 deselected, 1 error in 0.11s ======================== collection is blocked by missing optional dependency opacus in unrelated DP tests.\n\nCovers Linear issue DT-416.
DT-417 review found that the built-in GradientAccumulationStepStrategy did not publish optimizer_step_completed, so local_steps_per_round counted raw batches instead of optimizer steps when accumulation_steps > 1.

Set optimizer_step_completed only when the accumulation strategy actually performs optimizer.step(), and add a regression test that uses the real built-in accumulation strategy with H=2 and accumulation_steps=3.

Validation: uv run pytest tests/trainers/test_composable_trainer.py -k local_steps; uv run pytest tests/trainers/test_composable_trainer.py; uv run ruff check . --select I.
DT-412 makes the DiLoCo outer optimizer apply to trainable floating parameters by default while preserving full state_dict safety for frozen parameters and buffers.

The new apply_outer_optimizer_to option supports parameters and all_floating modes, validates unsupported values clearly, resolves trainable parameter names from context.trainer.model for the default mode, and keeps momentum state only for tensors that receive outer optimization.

Tests cover trainable parameters, frozen parameters, floating buffers, integer and boolean buffers, all_floating behavior, missing model context, invalid config values, and the existing DiLoCo aggregation math.

Validation: uv run pytest tests/servers/test_diloco_strategy.py; uv run pytest tests/servers/test_fedavg_strategy.py; uv run ruff check . --select I.
DT-413 review found that default parameter eligibility missed trainable adapter parameters when adapter payload keys omit PEFT adapter-name segments, such as lora_A.weight versus lora_A.default.weight.

Resolve trainable payload aliases from model adapter metadata and intersect them with the actual floating payload leaves, so exact state_dict keys still work while PEFT-style adapter payloads receive outer optimization and momentum.

Added a PEFT-like regression test that fails without the alias mapping and verifies the payload key receives SGDM scaling and momentum state.

Validation: uv run pytest tests/servers/test_diloco_strategy.py -k peft_adapter -q; uv run pytest tests/servers/test_diloco_strategy.py; uv run pytest tests/servers/test_fedavg_strategy.py; uv run ruff check . --select I.
DT-413 re-review found that adapter-name aliasing could include a separate floating payload key when the exact trainable parameter key was also present.

Make exact payload key matches take precedence over adapter-name removal, so alias candidates are only considered when the original trainable parameter name is absent from the payload. Added a negative collision regression to keep unrelated payload keys on the plain averaged-delta path.

Validation: uv run pytest tests/servers/test_diloco_strategy.py -k "adapter_payload_names or alias_collisions" -q; uv run pytest tests/servers/test_diloco_strategy.py; uv run pytest tests/servers/test_fedavg_strategy.py; uv run ruff check . --select I.
DT-418 adds trainer.preserve_optimizer_state for the PyTorch ComposableTrainer in-process path so client-local AdamW and scheduler state survive communication rounds without entering client-server payloads.

The trainer caches optimizer and scheduler state per logical client, restores it after creating the next round optimizer/scheduler, and discards cached state when optimizer type, scheduler type, parameter names, shapes, dtypes, or optimizer parameter ordering no longer match.

Focused tests cover AdamW moment persistence, scheduler LR progress, logical-client isolation, payload locality, disabled behavior, optimizer changes, parameter-order changes, and shape/dtype/scheduler compatibility rejection.

Validation: uv run pytest tests/trainers/test_composable_optimizer_state.py -q; uv run pytest tests/trainers/test_composable_trainer.py -q; uv run pytest tests/trainers -k optimizer_state-or-scheduler_state-or-composable -q --ignore=tests/trainers/test_dp_data_loader_strategy.py; uv run ruff check . --select I. The unignored trainer selector still hits the repo optional opacus collection dependency in tests/trainers/test_dp_data_loader_strategy.py.
DT-414 adds server.type=diloco as a FedAvg-compatible server that injects DiLoCoAggregationStrategy while keeping algorithm.type=fedavg.

The FedAvg delta path now filters non-weight reports before compute_weight_deltas(), so feature or metrics payloads cannot crash delta-only strategies before strategy eligibility handling. DiLoCo remains on aggregate_deltas and does not use inherited direct weight aggregation.

Server-level tests cover registry/config selection, delta-path dispatch, inherited aggregate_weights avoidance, non-weight payload filtering, and existing FedAvg delta-strategy behavior.

Validation: uv run pytest tests/servers/test_diloco_strategy.py -q; uv run pytest tests/servers/test_fedavg_strategy.py -q; uv run ruff check . --select I.
DT-420 extends trainer.preserve_optimizer_state to ComposableTrainer subprocess training by using a local optimizer-state sidecar under the configured model path.

Child training loads any preserved sidecar before train_model(), saves updated optimizer and scheduler state after training, and the parent reloads the sidecar after the trained model is loaded. Missing, unreadable, invalid, or incompatible state falls back to fresh optimizer/scheduler state with explicit logging.

Tests cover parent reload, optimizer state persistence across two subprocess rounds, scheduler progress, invalid sidecar reset, disabled behavior, and payload non-leakage. State remains local and is not added to network payloads.

Validation: uv run pytest tests/trainers/test_composable_optimizer_state.py -k "subprocess and (optimizer_state or scheduler_state)" -q; uv run pytest tests/trainers/test_composable_optimizer_state.py -q; uv run pytest tests/trainers/test_composable_trainer.py -q; uv run pytest tests/trainers -k "subprocess and (optimizer_state or scheduler_state)" -q --ignore=tests/trainers/test_dp_data_loader_strategy.py; uv run ruff check . --select I; git diff --check.
DT-421 review found that a missing optimizer sidecar could leave inherited parent cache active in the child, and that parent reload could confuse stale input with current child output.

The child now clears inherited cache when the input sidecar is missing. Subprocess training writes to a unique child output sidecar, the parent loads that output, promotes it to the stable input sidecar for the next round, and removes stale stable state if child output is missing or invalid.

Added regressions for missing input sidecars clearing inherited cache and missing child output removing stale input sidecars.

Validation: uv run pytest tests/trainers/test_composable_optimizer_state.py -k "subprocess or sidecar" -q; uv run pytest tests/trainers/test_composable_optimizer_state.py -q; uv run pytest tests/trainers/test_composable_trainer.py -q; uv run pytest tests/trainers -k "subprocess and (optimizer_state or scheduler_state)" -q --ignore=tests/trainers/test_dp_data_loader_strategy.py; uv run ruff check . --select I; git diff --check.
DT-422 adds regression tests proving client-local optimizer and scheduler state remains local when trainer.preserve_optimizer_state is enabled.

Client tests now cover the FedAvg/DiLoCo-compatible in-process path and subprocess sidecar path, asserting outbound payloads contain exactly model state tensors and reject optimizer_state, scheduler_state, global_step, local metadata, and sidecar filename keys.

Trainer tests also verify model-update payloads stay model-only while optimizer and scheduler state are persisted locally.

Validation: uv run pytest tests/clients -k "payload or simple" -q; uv run pytest tests/trainers -k "optimizer_state or scheduler_state" -q --ignore=tests/trainers/test_dp_data_loader_strategy.py; uv run ruff check . --select I; git diff --check.
DT-428 prevents exact local-step training from replaying the same deterministic sampler prefix when H is smaller than one epoch and the train loader is recreated each round.

The data-loader strategies now materialize supported sampler streams only when trainer.local_steps_per_round is active, rotate the stream by the deterministic round offset, and leave epoch-based training unchanged when local-step limits are unset. Unsupported non-materializable sampler objects log a clear warning and fall back unchanged.

Added focused red/green coverage showing two short local-step rounds for the same client consume different prefixes while repeated runs with the same round sequence remain deterministic.

Validation: uv run pytest tests/trainers -k "local_steps or data_loader or sampler" -q; uv run pytest tests/samplers -q; uv run ruff check . --select I; git diff --check.
DT-429 review found that the round-aware local-step sampler wrapper only treated TypeError as an unsupported materialization path. Samplers that raise NotImplementedError during iteration should also warn and fall back unchanged instead of failing while setting up the data loader.

This patch catches NotImplementedError in the same warning/fallback path and adds regression coverage with a non-materializable sampler to verify the warning and unchanged sampler handoff.

Validation: uv run pytest tests/trainers/test_composable_trainer.py -k "non_materializable or local_steps" -q; uv run pytest tests/trainers -k "local_steps or data_loader or sampler" -q; uv run pytest tests/samplers -q; uv run ruff check plato/trainers/strategies/data_loader.py tests/trainers/test_composable_trainer.py --select I; git diff --check.
DT-424 adds a small MNIST/LeNet DiLoCo smoke config that uses the faithful configuration contract: server.type=diloco, algorithm.type=fedavg, local_steps_per_round=2, preserve_optimizer_state=true, AdamW inner optimizer, Nesterov outer optimizer, uniform weighting, and parameter-only outer updates.

The docs now explain how to run the smoke config, distinguish algorithm mechanics from reproducing the paper C4/model/pretraining setup, and document H semantics, mid-epoch stopping, round-aware small-H sampling, local-only optimizer and scheduler state, FedAvg equivalence conditions, and the parameter/buffer policy.

The integration smoke test loads the real config, verifies the contract values, and checks that the server registry selects the DiLoCo server and DiLoCo aggregation strategy.

Validation: uv run pytest tests/integration/test_smoke_configs.py -k diloco -q; uv run ruff check . --select I; git diff --check.
DT-426 adds final integration coverage for the faithful DiLoCo path using the exact MNIST smoke config. The test builds the configured DiLoCo server and simple client, runs local training with local_steps_per_round=2 and preserved optimizer state, verifies the outbound client payload remains model weights only, and processes deterministic server updates through the DiLoCo delta aggregation path.

The validation would fail if the config selected ordinary FedAvg server aggregation, if local step control were ignored, or if the server bypassed aggregate_deltas. It directly checks the Nesterov outer update differs from ordinary FedAvg averaging, while relying on reviewed lower-level tests for small-H mid-epoch stopping, round-aware sampler non-replay, scheduler sidecar persistence, and broader payload leak coverage.

Validation: uv run pytest tests/integration/test_smoke_configs.py -k diloco -q; uv run pytest tests/servers/test_diloco_strategy.py -q; uv run pytest tests/trainers -k "local_steps or optimizer_state or scheduler_state or data_loader" -q; uv run pytest tests/clients -k "payload or simple" -q; uv run ruff check . --select I; git diff --check.
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 29, 2026

Deploy Preview for platodocs canceled.

Name Link
🔨 Latest commit 6ffb475
🔍 Latest deploy log https://app.netlify.com/projects/platodocs/deploys/69f39730be8a0b0008adb0d5

Preserved AdamW state was loaded before ComposableTrainer moved the model to the trainer device. On GPU, PyTorch therefore mapped restored optimizer tensors to CPU and later optimizer.step() saw CUDA parameters with CPU Adam state, producing mixed-device runtime errors in later rounds.

Move the model to the trainer device before optimizer construction and preserved-state restore so optimizer.load_state_dict() maps state tensors onto the same device as the optimizer parameters. Add a regression test that fails if preserved optimizer state is restored before model.to().

Validation: uv run pytest tests/trainers/test_composable_optimizer_state.py -k "restores_after_model_moves_to_device or optimizer_state or scheduler_state" -q; uv run pytest tests/trainers -k "local_steps or optimizer_state or scheduler_state or data_loader" -q; uv run pytest tests/integration/test_smoke_configs.py -k diloco -q; uv run pytest tests/clients -k "payload or simple" -q; uv run ruff check . --select I; git diff --check.
@baochunli baochunli changed the title [codex] Add faithful DiLoCo support Added support for DiLoCo. Apr 30, 2026
baochunli and others added 4 commits April 30, 2026 02:07
Emit a server-side info log each time DiLoCo applies the configured
outer optimizer to averaged client deltas. The log includes optimizer
settings, aggregation weighting, apply policy, eligible update count, and
optimized tensor count so runs show where the server update occurs.

Validation:
- uv run pytest tests/servers/test_diloco_strategy.py -q
- uv run pytest tests/integration/test_smoke_configs.py -k diloco -q
- uv run ruff check plato/servers/strategies/aggregation/diloco.py tests/servers/test_diloco_strategy.py --select I
- git diff --check

Co-authored-by: Codex <codex@openai.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant