I am training the exact same network in 3 setting: 1GPU (A100), 1GPU using gradient accumulation (Titan), and 2GPUs (V100) via DataParallel.
The problem is that I get 3 fairly different training outcomes with 3 fairly different validation results.
The effective batch size of all 3 experiments is the same. For the 1 GPU and 2 GPU setting without gradient accumulation, it is 6 and for the single GPU + gradient accumulation it is 3 with accumulation step of 2.
Only group norms are used in the network (no batch norms are used).
I use the same learning rate for all the experiments.
For the experiment using a single GPU with gradient accumulation I divide the loss by the accumulation steps as suggested here. (Is this really necessary?)
I would appreciate any hints on what I might be doing wrong? Should not the training be fairly similar in these 3 cases?