How do gradients accumulate when using different batch sizes on one GPU and multiple GPUS?

feiyuhuahuo · October 13, 2020, 2:57am

I have two situations:

If I use batch size 8 on one GPU, the gradients of the first 4 samples and the gradients of the last 4 samples are summed together, right?
Then I am training on two GPUs with Distributed Data Parallel (batch size is 4 per GPU, 8 for total). I read from PyTorch Documents that gradients are averaged when synchronizing. So the gradients of the first 4 samples and the gradients of the last 4 samples are averaged, right?

Am I right for these two situations?
Besides, can I sum the gradients when synchronizing just as the way training on one GPU does? Is there a parameter to control this when using Distributed Data Parallel?

albanD · October 13, 2020, 2:38pm

Hi,

Yes they are all summed
You are they are averaged accross GPU only (not within). So for 2 GPU, you will get the sum/2.

I don’t think DDP has an option to disable this averaging. You can multiply the final gradients by the number of devices if you need the sum.

feiyuhuahuo · October 14, 2020, 2:48am

Hi,

It’s not easy to manage the gradients of all parameters. So if I need to sum the gradients across 2 GPUs, do you think that adding loss_total*=2 below 'loss_total=loss_b + loss_m + loss_c + loss_s` can achieve this effect?

albanD · October 14, 2020, 3:12pm

Hi,

Yes scaling the loss by 2 will also scale all the gradients by 2 !