Hi,
I am finding that my model trains a lot better when using 2 GPUs (batch size 512 per GPU) than using 8 GPUs (batch size 128 per GPU). The learning rates are both scaled to be the same, i.e:
base_lr * batch_size * world_size / 512.
The observed training losses after every iteration are pretty much the same for both settings too (2 GPUs or 8 GPUs). The testing function also gives the same accuracy if I use 2 GPUs or 8 GPUs with the same set of checkpoint weights. From my understanding, the validation accuracy should be roughly the same in both of these settings right? I am using DDP for both and so the only difference is just the number of GPUs.
Here are some of the logs.
2 GPUs val logs:
epoch,acc1,acc5
0,1.99600,6.77200
1,8.25600,22.42600
2,14.43200,33.19000
3,19.72200,41.67400
4,24.84400,49.10000
5,29.33000,54.44000
6,33.52400,59.05400
7,37.01000,62.97200
8,40.27800,66.18400
…
8 GPUs val logs:
epoch,acc1,acc5
0,0.97200,4.01400
1,2.47400,8.75200
…
=========================================
8 GPUs train logs:
epoch,itr,xe_loss,distill_loss,kl_loss
0,0,6.97100,0.2382,4.47523
0,1,6.99785,0.3105,4.11384
0,2,7.02923,0.30575,4.20045
0,3,7.00414,0.28821,3.98422
0,4,6.99461,0.27749,4.11011
0,5,7.01260,0.26803,4.05124
0,6,6.99693,0.28150,4.34939
0,7,6.97944,0.26662,4.38219
0,8,7.00785,0.28510,4.59107
…
2 GPUs train logs:
epoch,itr,xe_loss,distill_loss,kl_loss
0,0,6.97061,0.24541,4.40004
0,1,7.05148,0.30718,4.09486
0,2,7.01019,0.30563,4.32012
0,3,6.98066,0.28823,3.98150
0,4,7.04321,0.27916,4.32263
0,5,7.04827,0.26795,4.03329
0,6,6.93550,0.28153,4.53965
0,7,7.07279,0.27601,4.47247
0,8,7.06095,0.29057,4.46109
0,9,6.99917,0.27686,4.19178
…