P1: Anima loss=nan with Automagic or CAME optimizer

## User report

A user reported that Anima training can quickly turn into `loss=nan` when using the `Automagic` or `pytorch_optimizer.CAME` optimizer.

Local report bundle (maintainer-only, not in repo):

- `automagic.json` / `came.json` (reporter TOML exports)
- screenshot showing finite loss then `avr_loss=nan` within the first few steps; log shows `unet dtype: torch.bfloat16`

## Reported optimizer configs

CAME case:

- `optimizer_type = "pytorch_optimizer.CAME"`
- `learning_rate = "2e-5"`
- `unet_lr = "2e-5"`
- `lr_scheduler = "constant_with_warmup"`
- `lr_warmup_steps = 100`
- `resolution = "768,768"`
- `save_precision = "bf16"`

Automagic case:

- `optimizer_type = "Automagic"`
- `learning_rate = "1e-6"`
- `unet_lr = "1e-6"`
- `lr_scheduler = "constant"`
- `resolution = "768,768"`
- `save_precision = "bf16"`

## Local reproduction attempt

Test environment (generic):

- Local Anima base model under `sd-models/anima/`
- Small private test dataset (not named in this issue)
- PyTorch: `2.6.0+cu124`
- GPU: RTX 4090

Short 8-step smoke runs did **not** reproduce NaN locally:

- `pytorch_optimizer.CAME`: completed 8 steps, final `avr_loss=0.113`
- `Automagic`: completed 8 steps, final `avr_loss=0.102`

This suggests the issue is not an unconditional optimizer crash. It is likely triggered by a risky precision/optimizer combination plus user environment, data order, longer training, or PyTorch/runtime differences.

### How to try reproducing the *old* failure mode

1. Use reporter configs above with Anima LoRA training.
2. Force the risky combo in the **adapted** `*-sd-scripts.toml` (or WebUI advanced): `mixed_precision="bf16"` **and** `full_bf16=true` (or `full_fp16=true`).
3. Run **≥ 50–200 steps** (8-step smokes may stay finite).
4. Confirm adapted TOML: if `full_bf16` is absent, backend mitigation is active (`apply_anima_training_defaults` strips it for CAME/Automagic).

On current `main`, step 2 is normally blocked unless you bypass the backend or use a build before the mitigation.

## Suspected risk factor

The backend previously auto-promoted Anima `mixed_precision=bf16` into `full_bf16=true`. That makes trainable LoRA weights/gradients run in bf16 too.

Upstream Anima docs recommend `mixed_precision="bf16"`, but do not require `full_bf16`. They also mention that if loss becomes NaN, PyTorch should be 2.5 or newer.

For adaptive optimizers like `Automagic` and `pytorch_optimizer.CAME`, full half-precision trainable weights are a high-risk default.

## Mitigation (implemented on main)

- Stop auto-enabling `full_bf16` / `full_fp16` for Anima just because `mixed_precision` is set.
- If Anima optimizer is `Automagic` or `pytorch_optimizer.CAME`, automatically remove `full_bf16` / `full_fp16` from submitted config and log a warning.
- Keep `mixed_precision=bf16` valid, but keep trainable LoRA weights in fp32 for stability.
- Tests: `tests/test_anima_training_defaults.py`
- Docs: `docs/anima-training.md` (NaN troubleshooting)

## Validation

- `python -m unittest tests.test_anima_training_defaults tests.test_anima_train_wrapper tests.test_anima_backend_adapter`
- Short CAME / Automagic smoke training (8 steps) with mitigation enabled

## Follow-up

If users still report NaN after this mitigation, collect:

- PyTorch / CUDA / GPU model
- Full training TOML after backend adaptation (`*-sd-scripts.toml`)
- Step where loss first becomes NaN
- Whether `full_bf16` / `full_fp16` is still present in the adapted TOML

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P1: Anima loss=nan with Automagic or CAME optimizer #51

User report

Reported optimizer configs

Local reproduction attempt

How to try reproducing the old failure mode

Suspected risk factor

Mitigation (implemented on main)

Validation

Follow-up

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

P1: Anima loss=nan with Automagic or CAME optimizer #51

Description

User report

Reported optimizer configs

Local reproduction attempt

How to try reproducing the old failure mode

Suspected risk factor

Mitigation (implemented on main)

Validation

Follow-up

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions