How is gradient of multi-gpu merged?

coincheung · September 17, 2019, 3:23am

Hi,

When I use two gpus to train my model, in the backward computation, are the gradients of the two gpus get added or averaged to for us to obtain the gradient for updating the model parameters?

As for the moving average parameters of batchnorm layers, if I do not use sync-bn, how is the moving average status of the final model computed?

albanD · September 17, 2019, 2:57pm

Hi,

It depends how you do multigpu. If you use DataParallel, then the gradients are accumulated on the first device in the list and then the new weights are shared to all the gpus.
For batchnorm when not synced I am not sure but I would guess it uses the statistics computed on the first device in the list.

coincheung · September 18, 2019, 12:01am

I used distributeddataparallel, does this method sum or average the gradients of multi-gpus?

I think this settings will have influence on adjusting the learning rate. According to the learning rate linear rule, when batch size is double, the learning rate should also be doubled. If the gradients are summed, the amplitude of the gradients are thus double, which means that I no longer need to double the learning rate, and I only need to adjust the learning rate according to the change of batch size on one single gpu, rather than the change of batch size brought by increasing gpu numbers. Would you please give some suggestions on how to adjust learning rate in the multi-gpu mode?

albanD · September 18, 2019, 1:33pm

The gradients are always summed, never averaged.
To adapt the learning rate, I would say it’s not as easy as adapting linearly to the learning rate. Larger batch sizes will reduce the noise in your gradients as well, so you might need to change your learning rate schedule to make larger steps?
I don’t think there is a one true rule on how to change the learning rate when you change the batch size unfortunately.

dio_din · April 25, 2020, 5:00pm

The doc says:

When this is done, averaged gradients are written to the param.grad field of all parameters.

albanD · April 27, 2020, 8:00pm

Hi,

Indeed, the DDP is doing something special here. It actually averages the gradients computed from each worker with equal weights.

dio_din · April 28, 2020, 2:49pm

So, is the weights 1/world_size or 1? Could you please provide a link to the code line?

albanD · April 28, 2020, 6:46pm

1/world_size the code is here.