Gradients not the same when using different number of GPUs despite using grad accum and same batch ordering

divinho · June 28, 2022, 1:50pm

I have a situation where using 4 GPUs leads to much worse results than when using 2 different GPUs (DDP). Gradient accumulation is used to equalize the effective batch size, and a seed is set so for each optimizer step the two different builds are seeing the same batches. But the gradients are not equal.

I have turned off mixed precision, and it is still the case (differences in the gradient just before updating the weights, after the gradients have been accumulated). Why could this be?

I’m comparing gradient norms. The differences are of the sort:

1469.666259765625 <> 1518.424072265625
294.58251953125 <> 356.9222106933594
138.08688354492188 <> 168.6200408935547
933.7567138671875 <> 957.8016357421875

These are gradient norms from the last layers in the model, what’s interesting is that one of these has consistently larger gradients.