DistributedDataParallel loss compute and backpropogation?

It is not necessary to use another allreduce to sum all loss. And additional allreduce might have considerable negative impact on training speed.

Could increasing the learningrate by a factor of x4 compensate for the division by number of gpus done by the averaging?

This is not guaranteed and the loss function itself also plays a role here. See this discussion: Should we split batch_size according to ngpu_per_node when DistributedDataparallel

I am trying to get a DDP run equivalent to Dataparallel.

There is a subtle difference between DP and DDP. IIUC, with DP, the grads from replicated models are accumulated (i.e., sum) into the param.grad field in the original model, but DDP’s gradient is averaged. Not 100% confident, but I feel if we would like to let DDP behave as similar to DP as possible, we probably should multiple DDP’s result gradient by world_size. Whether that is the same as using 4X learning rate, might depend on the optimizer algorithm.

2 Likes