How can I keep the batch_size per gpu in DDP?

@mrshenli

When you say each process processes batch_size / num_gpu samples, does that mean each GPU is assigned one process, which will each have their own DataLoader instance with batch_size equal to whatever you set in your code? So each GPU (not each node) will process an equal split of the programmer-set batch size?

And then when you say DDP computes the average gradient across all processes, does that mean it divides by the number of GPUs (not the number of nodes)?

Thanks for the clarification!