If I use batch size 8 on one GPU, the gradients of the first 4 samples and the gradients of the last 4 samples are summed together, right?
Then I am training on two GPUs with Distributed Data Parallel (batch size is 4 per GPU, 8 for total). I read from PyTorch Documents that gradients are averaged when synchronizing. So the gradients of the first 4 samples and the gradients of the last 4 samples are averaged, right?
Am I right for these two situations?
Besides, can I sum the gradients when synchronizing just as the way training on one GPU does? Is there a parameter to control this when using Distributed Data Parallel?
It’s not easy to manage the gradients of all parameters. So if I need to sum the gradients across 2 GPUs, do you think that adding loss_total*=2 below 'loss_total=loss_b + loss_m + loss_c + loss_s` can achieve this effect?