Some benchmark results and issues on 2*RTX4090

Hi,
I am doing tinyllama pretraining on Chinese dataset using your code, it is very helpful for me, thanks.
### Benmark results

| Model    | GPU        |  Distribution Type|Batch Size Per GPU | Gradient Accumulation Steps|GPU Memory | Speed (tokens/s) |
| --------- | ---------- |--------------------| ------------------ |-------| ---------- | ---------------- |
| tinyllama | 2*RTX4090  | DeepSpeed Zero-2| 3                  |4     | 21G      | 1.8k              |
| tinyllama | 2*RTX4090  | DDP                       | 3                  |4     |  21G      | 2.7k              |
| tinyllama | 2*RTX4090  | DDP                       | 3                  |1     |  21G      | 1.5k              |
| tinyllama | 1*RTX4090  | N/A| 3                  |4    |21G      | 1.8k              |

### Some issues:
- the token thoughput is much slower than 8*RTX3090, and DeepSpeed Zero-2 performed worse than DDP and even no better than single RTX4090.
- I can't set per_device_train_batch_size to a value greater than 3, in case I set it to 4, auto_find_batch_size will reset it to 2.

### Environments
deepspeed  0.9.5
transformers 4.37.2
torch 2.0.1+cu118
flash-attn 2.4.2

Any ideas I can improve the thoughput? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some benchmark results and issues on 2*RTX4090 #1

Benmark results

Some issues:

Environments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	GPU	Distribution Type	Batch Size Per GPU	Gradient Accumulation Steps	GPU Memory	Speed (tokens/s)
tinyllama	2*RTX4090	DeepSpeed Zero-2	3	4	21G	1.8k
tinyllama	2*RTX4090	DDP	3	4	21G	2.7k
tinyllama	2*RTX4090	DDP	3	1	21G	1.5k
tinyllama	1*RTX4090	N/A	3	4	21G	1.8k

Some benchmark results and issues on 2*RTX4090 #1

Description

Benmark results

Some issues:

Environments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions