User report
A user reported that Anima training can quickly turn into loss=nan when using the Automagic or pytorch_optimizer.CAME optimizer.
Local report bundle (maintainer-only, not in repo):
automagic.json / came.json (reporter TOML exports)
- screenshot showing finite loss then
avr_loss=nan within the first few steps; log shows unet dtype: torch.bfloat16
Reported optimizer configs
CAME case:
optimizer_type = "pytorch_optimizer.CAME"
learning_rate = "2e-5"
unet_lr = "2e-5"
lr_scheduler = "constant_with_warmup"
lr_warmup_steps = 100
resolution = "768,768"
save_precision = "bf16"
Automagic case:
optimizer_type = "Automagic"
learning_rate = "1e-6"
unet_lr = "1e-6"
lr_scheduler = "constant"
resolution = "768,768"
save_precision = "bf16"
Local reproduction attempt
Test environment (generic):
- Local Anima base model under
sd-models/anima/
- Small private test dataset (not named in this issue)
- PyTorch:
2.6.0+cu124
- GPU: RTX 4090
Short 8-step smoke runs did not reproduce NaN locally:
pytorch_optimizer.CAME: completed 8 steps, final avr_loss=0.113
Automagic: completed 8 steps, final avr_loss=0.102
This suggests the issue is not an unconditional optimizer crash. It is likely triggered by a risky precision/optimizer combination plus user environment, data order, longer training, or PyTorch/runtime differences.
How to try reproducing the old failure mode
- Use reporter configs above with Anima LoRA training.
- Force the risky combo in the adapted
*-sd-scripts.toml (or WebUI advanced): mixed_precision="bf16" and full_bf16=true (or full_fp16=true).
- Run ≥ 50–200 steps (8-step smokes may stay finite).
- Confirm adapted TOML: if
full_bf16 is absent, backend mitigation is active (apply_anima_training_defaults strips it for CAME/Automagic).
On current main, step 2 is normally blocked unless you bypass the backend or use a build before the mitigation.
Suspected risk factor
The backend previously auto-promoted Anima mixed_precision=bf16 into full_bf16=true. That makes trainable LoRA weights/gradients run in bf16 too.
Upstream Anima docs recommend mixed_precision="bf16", but do not require full_bf16. They also mention that if loss becomes NaN, PyTorch should be 2.5 or newer.
For adaptive optimizers like Automagic and pytorch_optimizer.CAME, full half-precision trainable weights are a high-risk default.
Mitigation (implemented on main)
- Stop auto-enabling
full_bf16 / full_fp16 for Anima just because mixed_precision is set.
- If Anima optimizer is
Automagic or pytorch_optimizer.CAME, automatically remove full_bf16 / full_fp16 from submitted config and log a warning.
- Keep
mixed_precision=bf16 valid, but keep trainable LoRA weights in fp32 for stability.
- Tests:
tests/test_anima_training_defaults.py
- Docs:
docs/anima-training.md (NaN troubleshooting)
Validation
python -m unittest tests.test_anima_training_defaults tests.test_anima_train_wrapper tests.test_anima_backend_adapter
- Short CAME / Automagic smoke training (8 steps) with mitigation enabled
Follow-up
If users still report NaN after this mitigation, collect:
- PyTorch / CUDA / GPU model
- Full training TOML after backend adaptation (
*-sd-scripts.toml)
- Step where loss first becomes NaN
- Whether
full_bf16 / full_fp16 is still present in the adapted TOML
User report
A user reported that Anima training can quickly turn into
loss=nanwhen using theAutomagicorpytorch_optimizer.CAMEoptimizer.Local report bundle (maintainer-only, not in repo):
automagic.json/came.json(reporter TOML exports)avr_loss=nanwithin the first few steps; log showsunet dtype: torch.bfloat16Reported optimizer configs
CAME case:
optimizer_type = "pytorch_optimizer.CAME"learning_rate = "2e-5"unet_lr = "2e-5"lr_scheduler = "constant_with_warmup"lr_warmup_steps = 100resolution = "768,768"save_precision = "bf16"Automagic case:
optimizer_type = "Automagic"learning_rate = "1e-6"unet_lr = "1e-6"lr_scheduler = "constant"resolution = "768,768"save_precision = "bf16"Local reproduction attempt
Test environment (generic):
sd-models/anima/2.6.0+cu124Short 8-step smoke runs did not reproduce NaN locally:
pytorch_optimizer.CAME: completed 8 steps, finalavr_loss=0.113Automagic: completed 8 steps, finalavr_loss=0.102This suggests the issue is not an unconditional optimizer crash. It is likely triggered by a risky precision/optimizer combination plus user environment, data order, longer training, or PyTorch/runtime differences.
How to try reproducing the old failure mode
*-sd-scripts.toml(or WebUI advanced):mixed_precision="bf16"andfull_bf16=true(orfull_fp16=true).full_bf16is absent, backend mitigation is active (apply_anima_training_defaultsstrips it for CAME/Automagic).On current
main, step 2 is normally blocked unless you bypass the backend or use a build before the mitigation.Suspected risk factor
The backend previously auto-promoted Anima
mixed_precision=bf16intofull_bf16=true. That makes trainable LoRA weights/gradients run in bf16 too.Upstream Anima docs recommend
mixed_precision="bf16", but do not requirefull_bf16. They also mention that if loss becomes NaN, PyTorch should be 2.5 or newer.For adaptive optimizers like
Automagicandpytorch_optimizer.CAME, full half-precision trainable weights are a high-risk default.Mitigation (implemented on main)
full_bf16/full_fp16for Anima just becausemixed_precisionis set.Automagicorpytorch_optimizer.CAME, automatically removefull_bf16/full_fp16from submitted config and log a warning.mixed_precision=bf16valid, but keep trainable LoRA weights in fp32 for stability.tests/test_anima_training_defaults.pydocs/anima-training.md(NaN troubleshooting)Validation
python -m unittest tests.test_anima_training_defaults tests.test_anima_train_wrapper tests.test_anima_backend_adapterFollow-up
If users still report NaN after this mitigation, collect:
*-sd-scripts.toml)full_bf16/full_fp16is still present in the adapted TOML