Q1: Equivalence
From my understanding, DDP spawns multiple unique processes for updating gradients. Each process in DDP gathers all the gradients from all other processes and averaging it via an all-reduction operation.
For example, if a researcher has published a paper stating to train using SGD, LR=0.1, batch = 64 for 100 epochs. Would it be equivalent to me doing DDP with SGD, LR=0.1, batch = 16 for 100 epochs if distributed to 4 GPUs? I want to know whether is it equivalent so that I scale experiments done in a single GPU to a distributed setting for faster training. It seems to me that it is equal unless I have misunderstood that all gradients from every process are averaged by every individual process in parallel.
I see an exact example here: examples/main.py at master · pytorch/examples · GitHub
Q2:
If I am using a workstation with more than 1 GPU, is there any particular situation where I should opt not to use DPP and just use 1 GPU for training?