I’m trying to employ larger batch sizes.
There are several techniques to implement large batch sizes:
data parallelism, gradient accumulation and batch size
I did toy examples and I found they are not the exact same results.
I did the three settings:
- batch size 64 with 1 gpu
- batch size 32 with 2 GPUs
- batch size 32 with 1 gpu and gradient accumulation steps 2
I used BertModels and mrpc dataset from Glue and Huggingface Trainer.(I also set seed)
Someone said that scaling learning rate is necessary.
How to scale learning rate with batch size for DDP training? #3706
But I don’t think that the gradient from 2 GPUs is not the random variables. I think they are deterministic when the seed is set.
Why the results are different?