Nan loss during training

Hi team, thanks for sharing this great work. I have a problem that when training with train.sh on 40GB A100. I set the batch_size=2 and gradient_accumulation_steps=16 with LR=5e-5 and 2.5e-5. The training loss become nan for both LR.

```
03/21/2024 18:26:37 - INFO - __main__ - Loaded lora parameters into model                                                                                                   [91/1907]
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All model weights loaded successfully                                                                                        
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All optimizer states loaded successfully                                                                                     
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All scheduler states loaded successfully                                                                                     
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All random states loaded successfully                                                                                        
03/21/2024 18:26:37 - INFO - accelerate.accelerator - Loading in 0 custom states                                                                                                     
Steps:   0%|                                                                                                                                                                  | 0/157
1000 [00:00<?, ?it/s]03/21/2024 18:26:37 - INFO - __main__ - Running validation...
                                                                                           
                    {'timestep_spacing'} was not found in config. Values will be initialized to default values.                                                                      
                    | 0/6 [00:00<?, ?it/s]   
Loaded scheduler as PNDMScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-2-1-base.
Loaded feature_extractor as CLIPImageProcessor from `feature_extractor` subfolder of stabilityai/stable-diffusion-2-1-base.
Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-2-1-base.
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00
:00<00:00, 20.30it/s]
{'use_karras_sigmas', 'solver_type', 'lambda_min_clipped', 'timestep_spacing', 'sample_max_value', 'dynamic_thresholding_ratio', 'solver_order', 'thresholding', 'variance_type', 'al
gorithm_type', 'lower_order_final'} was not found in config. Values will be initialized to default values.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00
:09<00:00,  5.04it/s]
writing inference outputs failed module 'ffmpeg' has no attribute 'input'█████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00
:09<00:00,  5.55it/s]
03/21/2024 18:26:49 - INFO - __main__ - Running training...                                                                                                                          
Steps:   0%|                                                                                                                                     | 0/1571000 [00:31<?, ?it/s, lr=2.5e
-5, step_loss=0.0496]/lustre/scratch/client/vinai/users/tungdt33/env/viewdiff/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match buc
ket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an er
ror, but may impair performance.             
grad.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1280, 1280]                                                                                                                 
bucket_view.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                                     
/lustre/scratch/client/vinai/users/tungdt33/env/viewdiff/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. Thi
s may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair p
erformance.                                  
grad.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1280, 1280]                                                                                                                 
bucket_view.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                                     
03/21/2024 18:27:11 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
03/21/2024 18:27:11 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
Steps:   0%|                                                                                                                          | 8/1571000 [03:57<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [03:59<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [04:00<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [04:02<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [04:03<10584:31:31, 24.25s/it, lr=2
.5e-5, step_loss=nan]
```

Do you have any suggestions?
Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nan loss during training #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nan loss during training #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions