Data parallelism, gradient accumulation and batch size

I’m trying to employ larger batch sizes.
There are several techniques to implement large batch sizes:
data parallelism, gradient accumulation and batch size

I did toy examples and I found they are not the exact same results.

I did the three settings:

  1. batch size 64 with 1 gpu
  2. batch size 32 with 2 GPUs
  3. batch size 32 with 1 gpu and gradient accumulation steps 2

I used BertModels and mrpc dataset from Glue and Huggingface Trainer.(I also set seed)

Someone said that scaling learning rate is necessary.
How to scale learning rate with batch size for DDP training? #3706
But I don’t think that the gradient from 2 GPUs is not the random variables. I think they are deterministic when the seed is set.

Why the results are different?

That’s not the case and you need to enable deterministic algorithms as explained in the docs. However, even then you cannot expect to receive bitwise-identical results since the workloads differ by batch size, different order of communication etc.