I have two situations:
- If I use batch size 8 on one GPU, the gradients of the first 4 samples and the gradients of the last 4 samples are summed together, right?
- Then I am training on two GPUs with Distributed Data Parallel (batch size is 4 per GPU, 8 for total). I read from PyTorch Documents that gradients are averaged when synchronizing. So the gradients of the first 4 samples and the gradients of the last 4 samples are averaged, right?
Am I right for these two situations?
Besides, can I sum the gradients when synchronizing just as the way training on one GPU does? Is there a parameter to control this when using Distributed Data Parallel?