How can I keep the batch_size per gpu in DDP?

If my model original batch-size is 32, when I use two gpus ,one node per gpu, I use batch-size 16(32/ngpu), but if the number of gpus is 3 or any odd number, we should keep the size like 32/3=10 or limit the number of gpus to 2?
Any help is welcome.

Hey @111344, if you are looking for mathematical equivalence, you will need at least two things:

  1. each DDP process processes batch_size / num_gpu samples: this allows DDP collectively process the same amount of inputs as local training.
  2. loss_fn(model([sample1, sample2])) == (loss_fn(model([sample1])) + loss_fn(model[sample2])) /2: this is because DDP uses AllReduce to compute the average gradients across all processes. If the above condition is not met, then average gradients across all DDP processes are not equivalent the the local training gradients.

However, practically, applications usually do not have to satisfy the above conditions. Did you see any training accuracy degradation when scale up to 3 GPUs?

1 Like

I haven’t tested GPU = 3, but I will take the time to verify it and then reply to you

1 Like

I think there is no difference between gpu=2 or 3.
In my experiment:
batch-size=8 gpu=2 -->batch_size=4 for single gpu
batch-size=8 gpu=3 -->batch_size=2 for single gpu(so total batch_size is 6)
batch-size=8 or 6, under normal circumstances, it does not have much impact on performance
For some task which are very sensitive to batch_size may need to take it into account

1 Like

@mrshenli

When you say each process processes batch_size / num_gpu samples, does that mean each GPU is assigned one process, which will each have their own DataLoader instance with batch_size equal to whatever you set in your code? So each GPU (not each node) will process an equal split of the programmer-set batch size?

And then when you say DDP computes the average gradient across all processes, does that mean it divides by the number of GPUs (not the number of nodes)?

Thanks for the clarification!