Xiao elo by xiaol827 · Pull Request #12 · Belilovsky-Lab/pylo

xiaol827 · 2026-06-20T05:13:47Z

Add ELO series LOs

Port the inference-time forward pass of the CELO2 / ELO-CELO2 learned optimizers from the original JAX/optax implementation into pure PyTorch, as drop-in pylo optimizers. - pylo/models/CELO2_MLP.py: CELO2MLP, the split-input per-parameter MLP backbone (14 split first-layer weights + dense layers, HF Hub mixin). - pylo/optim/CELO2_naive.py: CELO2_naive optimizer — momentum / RMS / factored Adafactor accumulators, CELO2 feature stack, Newton-Schulz orthogonalization for 2D+ params, AdamW for 1D params over the shared accumulators, and a warmup + cosine LR schedule. Loads its meta-model from HuggingFace (default DiamondXL/celo2); a local converted checkpoint or an explicit network take precedence. - pylo/optim/ELO_CELO2_naive.py: ELO_CELO2_naive — at inference the ELO expert mechanism is disabled, so it reduces to the CELO2 forward with the ELO default hyper-parameters (weight_decay=0.1, clip_grad=True); default meta-model DiamondXL/elo-celo2. - scripts/convert_celo2_checkpoint.py: convert a JAX/Haiku theta checkpoint into a CELO2MLP state_dict. - tests/test_celo2.py: step/update, higher-rank param, state-dict resume, and a JAX numerical-alignment test (2D update matches the reference to ~3e-6; auto-skips when the JAX source is unavailable). - Register the new classes in the pylo / pylo.optim / pylo.models inits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Port the inference-time forward pass of the ELO learned optimizer from the original JAX implementation (ELO_AdafacMLPLOpt) into pure PyTorch. At inference the ELO expert mechanism is disabled, so the update reduces to the Adafactor-MLP forward — identical features and meta-model (MetaMLP, 39 inputs / 2 outputs) to AdafacLO. ELO differs only in using raw accumulator decays, a warmup-then-constant (optionally cosine) LR schedule, and the update rule p -= lr * (dir*exp(mag*exp_mult) + wd*p). - pylo/optim/ELO_naive.py: ELO_naive optimizer (reuses MetaMLP and the AdafacLO feature helpers); default meta-model DiamondXL/elo. - scripts/convert_elo_checkpoint.py: convert a JAX/Haiku ELO theta into a MetaMLP state_dict (transposes the dense weights). - tests/test_elo.py: step/update, state-dict resume, and a JAX numerical-alignment test (matches the reference to ~1.5e-8; auto-skips when the JAX source is unavailable). - Register ELO_naive / ELO in the pylo and pylo.optim inits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ests End-to-end comparison against the real JAX optimizers revealed that the standalone Celo2LOpt drives its LR schedule through an optax chain whose step count starts at 0 (so the first update uses schedule(0)), whereas ELO_Celo2LOpt evaluates the schedule at iteration+1. CELO2_naive was 1-indexed (matching ELO-CELO2), leaving a ~1.8e-3 warmup-phase discrepancy vs Celo2LOpt. - Add a per-class LR-schedule offset: CELO2_naive is 0-indexed (matches Celo2LOpt), ELO_CELO2_naive overrides to 1-indexed (matches ELO_Celo2LOpt). AdamW bias correction stays 1-indexed in both. - Add test_jax_end_to_end_alignment: drives the real Celo2LOpt / ELO_Celo2LOpt over a multi-step trajectory with a 2D weight + 1D bias, nonzero weight decay and enabled gradient clipping, exercising the full step() (1D AdamW, schedule, weight decay, global-norm clipping). Both match the reference to ~6e-8 (was only the 2D core verified before). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Keep the existing VeLO_CUDA Quick Start intact and append an ELO-CELO2 example plus an "ELO series" entry with the arXiv link. Pure additions (no content removed from the xiao_elo README). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge the CUDA path from the ELO-torch line into xiao_elo (additive; the existing naive optimizers and VeLO/AdafacLO kernels are untouched): - pylo/csrc/celo2_kernel.cu: fused feature-construction + split-input MLP forward kernel for CELO2 / ELO-CELO2 - pylo/optim/CELO2_cuda.py, ELO_CELO2_cuda.py: Python wrappers - tests/test_celo2_cuda.py: CUDA-vs-naive numerical alignment tests - setup.py: register the celo2_cuda_kernel CUDAExtension - pylo/optim/__init__.py: override CELO2 / ELO_CELO2 to the CUDA variants when the extension is available, falling back to naive otherwise - .gitignore: ignore *.pickle LO checkpoints Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Optimizers no longer carry an internal warmup/cosine schedule — drive it with an external torch.optim.lr_scheduler via a plain `lr`. Applies to CELO2 (naive + cuda), ELO-CELO2 (naive + cuda), and ELO (naive). - CELO2_naive/CELO2_cuda: AdamW branch for 1D/embedding params now keeps its own exp_avg/exp_avg_sq moments (adam_betas/adam_eps), decoupled from the learned optimizer's momentum/RMS accumulators (numerically identical at default betas). - ELO_CELO2_naive/ELO_CELO2_cuda: no longer subclass CELO2; standalone classes that keep the original shared-accumulator AdamW design. - ELO_naive: drop num_steps/step_mult/warmup schedule, keep iteration counter for the tanh_embedding meta-model feature. - Add ELO_CUDA: CUDA ELO reusing the cuda_lo kernel (shared with AdafacLO), raw decays, lr-only scaling, exact pre-step weight-decay match; wired in __init__ and covered by tests/test_elo_cuda.py. - Fix device selection in naive optimizers to follow the parameters' device instead of torch.cuda.is_available() (CELO2/ELO/ELO-CELO2/AdafacLO/VeLO). - README quick start + tests updated to the new lr API; JAX end-to-end alignment pinned to a constant schedule (init_lr==peak_lr==end_lr) so it still holds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Pauljanson002

Looks good to me

Copilot

Pull request overview

Adds ELO-series learned optimizers (ELO, CELO2, and ELO-CELO2) to PyLO, including both naive (pure PyTorch) and CUDA-accelerated implementations, plus alignment tests and checkpoint-conversion utilities.

Changes:

Introduces new optimizer implementations: ELO_naive/ELO_CUDA, CELO2_naive/CELO2_CUDA, and ELO_CELO2_naive/ELO_CELO2_CUDA.
Adds CUDA kernel + build plumbing for celo2_cuda_kernel, and wires new optimizers into pylo.optim/top-level exports.
Adds CPU/JAX and CUDA-vs-naive numerical-alignment tests, plus scripts to convert JAX checkpoints to PyTorch state_dicts.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_elo.py	ELO naive smoke/resume + optional JAX alignment
tests/test_elo_cuda.py	CUDA ELO vs naive trajectory alignment tests
tests/test_celo2.py	CELO2/ELO-CELO2 smoke/resume + optional JAX alignment
tests/test_celo2_cuda.py	CUDA CELO2/ELO-CELO2 vs naive alignment tests
setup.py	Adds build of `celo2_cuda_kernel` extension
scripts/convert_elo_checkpoint.py	Converts JAX ELO theta → `MetaMLP` state_dict
scripts/convert_celo2_checkpoint.py	Converts JAX CELO2 theta → `CELO2MLP` state_dict
README.md	Documents ELO-series + updates Quick Start example
pylo/optim/Velo_naive.py	Device selection now follows params’ device
pylo/optim/velo_cuda.py	Device selection now follows params’ device
pylo/optim/ELO_naive.py	New pure-PyTorch ELO optimizer
pylo/optim/ELO_cuda.py	New CUDA ELO optimizer using `cuda_lo` kernel
pylo/optim/ELO_CELO2_naive.py	New pure-PyTorch ELO-CELO2 optimizer
pylo/optim/ELO_CELO2_cuda.py	New CUDA ELO-CELO2 optimizer using `celo2_cuda_kernel`
pylo/optim/CELO2_naive.py	New pure-PyTorch CELO2 optimizer
pylo/optim/CELO2_cuda.py	New CUDA CELO2 optimizer using `celo2_cuda_kernel`
pylo/optim/AdafacLO_naive.py	Device selection now follows params’ device
pylo/optim/init.py	Exports/wires CELO2/ELO/ELO-CELO2 + CUDA fallbacks
pylo/models/CELO2_MLP.py	Adds CELO2 split-input MLP meta-model
pylo/models/init.py	Exports `CELO2MLP`
pylo/csrc/celo2_kernel.cu	Adds fused CELO2 feature+MLP CUDA kernel
pylo/init.py	Re-exports new optimizers and aliases
.gitignore	Ignores `.pickle` learned-optimizer checkpoints

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    naive_traj = _run(CELO2_naive, init_vals, grads_seq, net, **cfg)
+    cuda_traj = _run(CELO2_CUDA, init_vals, grads_seq, net, **cfg)
+
+    assert _max_traj_diff(naive_traj, cuda_traj) < 1e-5
+


+    naive_traj = _run(ELO_CELO2_naive, init_vals, grads_seq, net, **cfg)
+    cuda_traj = _run(ELO_CELO2_CUDA, init_vals, grads_seq, net, **cfg)
+
+    assert _max_traj_diff(naive_traj, cuda_traj) < 1e-5
+


+    naive_traj = _run(CELO2_naive, init, grads, net, **cfg)
+    cuda_traj = _run(CELO2_CUDA, init, grads, net, **cfg)
+
+    assert _max_traj_diff(naive_traj, cuda_traj) < 1e-5


xiaol827 and others added 7 commits June 9, 2026 17:23

Update README.md

6ed7398

xiaol827 requested a review from Pauljanson002 June 20, 2026 05:16

Pauljanson002 reviewed Jun 22, 2026

View reviewed changes

bentherien requested a review from Copilot June 22, 2026 21:47

Copilot started reviewing on behalf of bentherien June 22, 2026 21:47 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

xiaol827 merged commit d05c6c1 into main Jun 23, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Xiao elo#12

Xiao elo#12
xiaol827 merged 7 commits into
mainfrom
xiao_elo

xiaol827 commented Jun 20, 2026

Uh oh!

Pauljanson002 left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

xiaol827 commented Jun 20, 2026

Uh oh!

Pauljanson002 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants