Skip to content

P1: Anima loss=nan with Automagic or CAME optimizer #51

@wochenlong

Description

@wochenlong

User report

A user reported that Anima training can quickly turn into loss=nan when using the Automagic or pytorch_optimizer.CAME optimizer.

Local report bundle (maintainer-only, not in repo):

  • automagic.json / came.json (reporter TOML exports)
  • screenshot showing finite loss then avr_loss=nan within the first few steps; log shows unet dtype: torch.bfloat16

Reported optimizer configs

CAME case:

  • optimizer_type = "pytorch_optimizer.CAME"
  • learning_rate = "2e-5"
  • unet_lr = "2e-5"
  • lr_scheduler = "constant_with_warmup"
  • lr_warmup_steps = 100
  • resolution = "768,768"
  • save_precision = "bf16"

Automagic case:

  • optimizer_type = "Automagic"
  • learning_rate = "1e-6"
  • unet_lr = "1e-6"
  • lr_scheduler = "constant"
  • resolution = "768,768"
  • save_precision = "bf16"

Local reproduction attempt

Test environment (generic):

  • Local Anima base model under sd-models/anima/
  • Small private test dataset (not named in this issue)
  • PyTorch: 2.6.0+cu124
  • GPU: RTX 4090

Short 8-step smoke runs did not reproduce NaN locally:

  • pytorch_optimizer.CAME: completed 8 steps, final avr_loss=0.113
  • Automagic: completed 8 steps, final avr_loss=0.102

This suggests the issue is not an unconditional optimizer crash. It is likely triggered by a risky precision/optimizer combination plus user environment, data order, longer training, or PyTorch/runtime differences.

How to try reproducing the old failure mode

  1. Use reporter configs above with Anima LoRA training.
  2. Force the risky combo in the adapted *-sd-scripts.toml (or WebUI advanced): mixed_precision="bf16" and full_bf16=true (or full_fp16=true).
  3. Run ≥ 50–200 steps (8-step smokes may stay finite).
  4. Confirm adapted TOML: if full_bf16 is absent, backend mitigation is active (apply_anima_training_defaults strips it for CAME/Automagic).

On current main, step 2 is normally blocked unless you bypass the backend or use a build before the mitigation.

Suspected risk factor

The backend previously auto-promoted Anima mixed_precision=bf16 into full_bf16=true. That makes trainable LoRA weights/gradients run in bf16 too.

Upstream Anima docs recommend mixed_precision="bf16", but do not require full_bf16. They also mention that if loss becomes NaN, PyTorch should be 2.5 or newer.

For adaptive optimizers like Automagic and pytorch_optimizer.CAME, full half-precision trainable weights are a high-risk default.

Mitigation (implemented on main)

  • Stop auto-enabling full_bf16 / full_fp16 for Anima just because mixed_precision is set.
  • If Anima optimizer is Automagic or pytorch_optimizer.CAME, automatically remove full_bf16 / full_fp16 from submitted config and log a warning.
  • Keep mixed_precision=bf16 valid, but keep trainable LoRA weights in fp32 for stability.
  • Tests: tests/test_anima_training_defaults.py
  • Docs: docs/anima-training.md (NaN troubleshooting)

Validation

  • python -m unittest tests.test_anima_training_defaults tests.test_anima_train_wrapper tests.test_anima_backend_adapter
  • Short CAME / Automagic smoke training (8 steps) with mitigation enabled

Follow-up

If users still report NaN after this mitigation, collect:

  • PyTorch / CUDA / GPU model
  • Full training TOML after backend adaptation (*-sd-scripts.toml)
  • Step where loss first becomes NaN
  • Whether full_bf16 / full_fp16 is still present in the adapted TOML

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High priority issue that affects user experience but is not an immediate release blockerbugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions