How do gradients accumulate when using different batch sizes on one GPU and multiple GPUS?

I have two situations:

  1. If I use batch size 8 on one GPU, the gradients of the first 4 samples and the gradients of the last 4 samples are summed together, right?
  2. Then I am training on two GPUs with Distributed Data Parallel (batch size is 4 per GPU, 8 for total). I read from PyTorch Documents that gradients are averaged when synchronizing. So the gradients of the first 4 samples and the gradients of the last 4 samples are averaged, right?

Am I right for these two situations?
Besides, can I sum the gradients when synchronizing just as the way training on one GPU does? Is there a parameter to control this when using Distributed Data Parallel?

Hi,

  1. Yes they are all summed
  2. You are they are averaged accross GPU only (not within). So for 2 GPU, you will get the sum/2.

I don’t think DDP has an option to disable this averaging. You can multiply the final gradients by the number of devices if you need the sum.

1 Like

Hi,
image
It’s not easy to manage the gradients of all parameters. So if I need to sum the gradients across 2 GPUs, do you think that adding loss_total*=2 below 'loss_total=loss_b + loss_m + loss_c + loss_s` can achieve this effect?

Hi,

Yes scaling the loss by 2 will also scale all the gradients by 2 !

1 Like