Hi,
I am doing tinyllama pretraining on Chinese dataset using your code, it is very helpful for me, thanks.
Benmark results
| Model |
GPU |
Distribution Type |
Batch Size Per GPU |
Gradient Accumulation Steps |
GPU Memory |
Speed (tokens/s) |
| tinyllama |
2*RTX4090 |
DeepSpeed Zero-2 |
3 |
4 |
21G |
1.8k |
| tinyllama |
2*RTX4090 |
DDP |
3 |
4 |
21G |
2.7k |
| tinyllama |
2*RTX4090 |
DDP |
3 |
1 |
21G |
1.5k |
| tinyllama |
1*RTX4090 |
N/A |
3 |
4 |
21G |
1.8k |
Some issues:
- the token thoughput is much slower than 8*RTX3090, and DeepSpeed Zero-2 performed worse than DDP and even no better than single RTX4090.
- I can't set per_device_train_batch_size to a value greater than 3, in case I set it to 4, auto_find_batch_size will reset it to 2.
Environments
deepspeed 0.9.5
transformers 4.37.2
torch 2.0.1+cu118
flash-attn 2.4.2
Any ideas I can improve the thoughput?
Hi,
I am doing tinyllama pretraining on Chinese dataset using your code, it is very helpful for me, thanks.
Benmark results
Some issues:
Environments
deepspeed 0.9.5
transformers 4.37.2
torch 2.0.1+cu118
flash-attn 2.4.2
Any ideas I can improve the thoughput?