Skip to content

Nan loss during training #10

@ItsThanhTung

Description

@ItsThanhTung

Hi team, thanks for sharing this great work. I have a problem that when training with train.sh on 40GB A100. I set the batch_size=2 and gradient_accumulation_steps=16 with LR=5e-5 and 2.5e-5. The training loss become nan for both LR.

03/21/2024 18:26:37 - INFO - __main__ - Loaded lora parameters into model                                                                                                   [91/1907]
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All model weights loaded successfully                                                                                        
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All optimizer states loaded successfully                                                                                     
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All scheduler states loaded successfully                                                                                     
03/21/2024 18:26:37 - INFO - accelerate.checkpointing - All random states loaded successfully                                                                                        
03/21/2024 18:26:37 - INFO - accelerate.accelerator - Loading in 0 custom states                                                                                                     
Steps:   0%|                                                                                                                                                                  | 0/157
1000 [00:00<?, ?it/s]03/21/2024 18:26:37 - INFO - __main__ - Running validation...
                                                                                           
                    {'timestep_spacing'} was not found in config. Values will be initialized to default values.                                                                      
                    | 0/6 [00:00<?, ?it/s]   
Loaded scheduler as PNDMScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-2-1-base.
Loaded feature_extractor as CLIPImageProcessor from `feature_extractor` subfolder of stabilityai/stable-diffusion-2-1-base.
Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-2-1-base.
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00
:00<00:00, 20.30it/s]
{'use_karras_sigmas', 'solver_type', 'lambda_min_clipped', 'timestep_spacing', 'sample_max_value', 'dynamic_thresholding_ratio', 'solver_order', 'thresholding', 'variance_type', 'al
gorithm_type', 'lower_order_final'} was not found in config. Values will be initialized to default values.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00
:09<00:00,  5.04it/s]
writing inference outputs failed module 'ffmpeg' has no attribute 'input'█████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00
:09<00:00,  5.55it/s]
03/21/2024 18:26:49 - INFO - __main__ - Running training...                                                                                                                          
Steps:   0%|                                                                                                                                     | 0/1571000 [00:31<?, ?it/s, lr=2.5e
-5, step_loss=0.0496]/lustre/scratch/client/vinai/users/tungdt33/env/viewdiff/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match buc
ket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an er
ror, but may impair performance.             
grad.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1280, 1280]                                                                                                                 
bucket_view.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                                     
/lustre/scratch/client/vinai/users/tungdt33/env/viewdiff/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Grad strides do not match bucket view strides. Thi
s may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair p
erformance.                                  
grad.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1280, 1280]                                                                                                                 
bucket_view.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:323.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass                                                                                     
03/21/2024 18:27:11 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
03/21/2024 18:27:11 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
Steps:   0%|                                                                                                                          | 8/1571000 [03:57<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [03:59<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [04:00<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [04:02<10584:31:31, 24.25s/it, lr=2
Steps:   0%|                                                                                                                          | 8/1571000 [04:03<10584:31:31, 24.25s/it, lr=2
.5e-5, step_loss=nan]

Do you have any suggestions?
Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions