Can DDP be equivalent to a single process (GPU) update if the batch size is divided by the number of GPUs? Could there be a situation where it is slower than a single GPU?

Q1: Equivalence

From my understanding, DDP spawns multiple unique processes for updating gradients. Each process in DDP gathers all the gradients from all other processes and averaging it via an all-reduction operation.

For example, if a researcher has published a paper stating to train using SGD, LR=0.1, batch = 64 for 100 epochs. Would it be equivalent to me doing DDP with SGD, LR=0.1, batch = 16 for 100 epochs if distributed to 4 GPUs? I want to know whether is it equivalent so that I scale experiments done in a single GPU to a distributed setting for faster training. It seems to me that it is equal unless I have misunderstood that all gradients from every process are averaged by every individual process in parallel.

I see an exact example here: examples/main.py at master · pytorch/examples · GitHub

Q2:

If I am using a workstation with more than 1 GPU, is there any particular situation where I should opt not to use DPP and just use 1 GPU for training?

Regarding your first question. That is correct. DDP should give you the same result as if your training was run on a single process with a single GPU. As you described, DDP by default averages your gradients across all nodes, meaning except any deviations due to floating point arithmetic, mathematically the outputs should be identical (assuming you scale your hyperparameters -i.e. batch size- accordingly).

There is no definitive answer for your second question. Overall the best approach would be to simply measure the speed and the convergence rate of your training for a few epochs. One particular case where it might not be worth to use more than one GPU is if your batch size is fairly small and can already fit into a single GPU. Obviously in such case it makes not much sense to use a multi-GPU setting.

1 Like