Yet another question on DistributedDataParallel and gradient averaging

Do I understand correctly that DP will not average gradients of “replica” batches, while DDP will?

Is it always the case that gradient averaging is correct (assuming the loss averages over the “replica” batchsize)? If yes, is it the case because of the product operation in the chain rule?


right, DDP averages gradients in default, as it assumes batch size is the same for each rank. If the batch size is not the same for each rank, or the loss function does not expect averaged gradients, you can call ’ register_comm_hook()’ on the DDP model, define your own communication strategy (sum, or other ops)