BERT MULTI-GPU PRE-TRAIN ON ONE MACHINE WITHOUT HOROVOD (Data Parallelism)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

REASONABLE / PRINCIPLE

More gpu means more data in a batch (, batch size is larger). And the gradients of a batch data is averaged for back-propagation.

If the sum learning rate of one batch is fixed, then the learning rate of one data is smaller, when batch size is larger.

If the learning rate of one data is fixed, then the sum learning rate of one batch is larger, when batch size is larger.

It seems: More gpu --> Larger sum learning rate of one batch --> Faster training.

But as the learning rate can only be in the range of 1e-4 to 1e-5, train_samples_per_second becomes the evaluation metric for speed. The total batch size is the determining factor. Training will fail if the learning rate is set too high.

WHATS NEW

Using 1-GPU (100 batch size) vs using 4-GPU (400 batch size) for the same learning rate (0.00001) and same pre-training steps (1,000,000) will be no difference of 0.1% in downstream task accuracy.

REQUIREMENT

python 3

tensorflow 1.14 - 1.15

TRAINING

0, edit the input and output file name in create_pretraining_data.py and run_pretraining_gpu.py

1, run create_pretraining_data.py

2, run run_pretraining_gpu_v2.py

PARAMETERS

Edit n_gpus in run_pretraining_gpu_v2.py

DATA

In sample_text.txt, sentence is end by \n, paragraph is splitted by empty line.

EXPERIMENT RESULT ON DOWNSTREAM TASKS

Quora question pairs English dataset,

Official BERT: ACC 91.2, AUC 96.9

This BERT with pretrain loss 2.05: ACC 90.1, AUC 96.3

NOTE

1)

For HierarchicalCopyAllReduce MirroredStrategy, global_step/sec shows the sum of multi gpus' steps.

2)

batch_size is the batch_size per GPU, not the global_batch_size

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
tmp_data		tmp_data
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
bert_config.json		bert_config.json
create_pretraining_data.py		create_pretraining_data.py
modeling.py		modeling.py
optimization_gpu.py		optimization_gpu.py
requirements.txt		requirements.txt
run_pretraining_gpu.py		run_pretraining_gpu.py
run_pretraining_gpu_v2.py		run_pretraining_gpu_v2.py
sample_text.txt		sample_text.txt
tokenization.py		tokenization.py
vocab.py		vocab.py
vocab.txt		vocab.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERT MULTI-GPU PRE-TRAIN ON ONE MACHINE WITHOUT HOROVOD (Data Parallelism)

REASONABLE / PRINCIPLE

WHATS NEW

REQUIREMENT

TRAINING

PARAMETERS

DATA

EXPERIMENT RESULT ON DOWNSTREAM TASKS

NOTE

1)

2)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BERT MULTI-GPU PRE-TRAIN ON ONE MACHINE WITHOUT HOROVOD (Data Parallelism)

REASONABLE / PRINCIPLE

WHATS NEW

REQUIREMENT

TRAINING

PARAMETERS

DATA

EXPERIMENT RESULT ON DOWNSTREAM TASKS

NOTE

1)

2)

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages