Yep, this is true for the sum() / batch_size
case you mentioned, on the condition that all processes are using the same batch size. Here is the test to verify that:
In particular, the gradient from DDP might be n_gpu times larger than DP, where n_gpu is the number of GPUs. Even if this is true, that will not be a big problem, but DDP may require a different learning rate from DP. I just thought that way, but it needs a confirmation.
DDP computes the average of all gradients from all processes, so the gradient should be the same value as local training for the sum() / batch_size
case. What might affect the learning rate is the batch size you configured for each DDP process. If each process is using the same batch_size
as local training, it means that in each iteration the DDP gang collective process world_size * batch_size
input data, so you might be more confident on the result gradient compared to local training and might need to set the learning rate to a larger value. But this is not guaranteed. See this discussion: Should we split batch_size according to ngpu_per_node when DistributedDataparallel