Better accuracy when using fewer GPUs w/ DistributedDataParallel

Hi,

I am finding that my model trains a lot better when using 2 GPUs (batch size 512 per GPU) than using 8 GPUs (batch size 128 per GPU). The learning rates are both scaled to be the same, i.e:

base_lr * batch_size * world_size / 512.

The observed training losses after every iteration are pretty much the same for both settings too (2 GPUs or 8 GPUs). The testing function also gives the same accuracy if I use 2 GPUs or 8 GPUs with the same set of checkpoint weights. From my understanding, the validation accuracy should be roughly the same in both of these settings right? I am using DDP for both and so the only difference is just the number of GPUs.

Here are some of the logs.
2 GPUs val logs:

epoch,acc1,acc5
0,1.99600,6.77200
1,8.25600,22.42600
2,14.43200,33.19000
3,19.72200,41.67400
4,24.84400,49.10000
5,29.33000,54.44000
6,33.52400,59.05400
7,37.01000,62.97200
8,40.27800,66.18400

8 GPUs val logs:

epoch,acc1,acc5
0,0.97200,4.01400
1,2.47400,8.75200

=========================================

8 GPUs train logs:

epoch,itr,xe_loss,distill_loss,kl_loss
0,0,6.97100,0.2382,4.47523
0,1,6.99785,0.3105,4.11384
0,2,7.02923,0.30575,4.20045
0,3,7.00414,0.28821,3.98422
0,4,6.99461,0.27749,4.11011
0,5,7.01260,0.26803,4.05124
0,6,6.99693,0.28150,4.34939
0,7,6.97944,0.26662,4.38219
0,8,7.00785,0.28510,4.59107

2 GPUs train logs:

epoch,itr,xe_loss,distill_loss,kl_loss
0,0,6.97061,0.24541,4.40004
0,1,7.05148,0.30718,4.09486
0,2,7.01019,0.30563,4.32012
0,3,6.98066,0.28823,3.98150
0,4,7.04321,0.27916,4.32263
0,5,7.04827,0.26795,4.03329
0,6,6.93550,0.28153,4.53965
0,7,7.07279,0.27601,4.47247
0,8,7.06095,0.29057,4.46109
0,9,6.99917,0.27686,4.19178

Not necessarily, since the batch size is different between your approaches and will influence e.g. batchnorm layers and their running stats.