Average loss in DP and DDP

Yep, this is true for the sum() / batch_size case you mentioned, on the condition that all processes are using the same batch size. Here is the test to verify that:

In particular, the gradient from DDP might be n_gpu times larger than DP, where n_gpu is the number of GPUs. Even if this is true, that will not be a big problem, but DDP may require a different learning rate from DP. I just thought that way, but it needs a confirmation.

DDP computes the average of all gradients from all processes, so the gradient should be the same value as local training for the sum() / batch_size case. What might affect the learning rate is the batch size you configured for each DDP process. If each process is using the same batch_size as local training, it means that in each iteration the DDP gang collective process world_size * batch_size input data, so you might be more confident on the result gradient compared to local training and might need to set the learning rate to a larger value. But this is not guaranteed. See this discussion: Should we split batch_size according to ngpu_per_node when DistributedDataparallel