Average loss in DP and DDP

mrshenli · August 20, 2020, 2:15pm

Good question. Instead of communicating loss, DDP communicates gradients. So the loss is local to every process, but after the backward pass, the gradient is globally averaged, so that all processes will see the same gradient. This is brief explanation, and this is a full paper describing the algorithm.

If it is not correct, I think it means that we need to do all_reduce of the loss before we do loss.backward in order to hand total loss information to each process for computing correct gradients. Is my thinking correct?

The reason we didn’t communicating loss is because that’s not sufficient. When computing gradients, we need both loss and activation, and the activation depends on local inputs. So we need to either communicate loss + activation or gradients. DDP does the later.