I have a situation where using 4 GPUs leads to much worse results than when using 2 different GPUs (DDP). Gradient accumulation is used to equalize the effective batch size, and a seed is set so for each optimizer step the two different builds are seeing the same batches. But the gradients are not equal.
I have turned off mixed precision, and it is still the case (differences in the gradient just before updating the weights, after the gradients have been accumulated). Why could this be?
I’m comparing gradient norms. The differences are of the sort:
1469.666259765625 <> 1518.424072265625 294.58251953125 <> 356.9222106933594 138.08688354492188 <> 168.6200408935547 933.7567138671875 <> 957.8016357421875
These are gradient norms from the last layers in the model, what’s interesting is that one of these has consistently larger gradients.