When you say each process processes batch_size / num_gpu
samples, does that mean each GPU is assigned one process, which will each have their own DataLoader
instance with batch_size
equal to whatever you set in your code? So each GPU (not each node) will process an equal split of the programmer-set batch size?
And then when you say DDP computes the average gradient across all processes, does that mean it divides by the number of GPUs (not the number of nodes)?
Thanks for the clarification!