DistributedDataParallel modify gradient before averaging

Hi all! I think the “DistributedDataParallel” automatically average the gradient when calling “loss.backward()”. But is it possible to first compute the local gradient of the parameters, then do some modification to the local gradient, and finally average the gradient among the workers?

Thanks!

@anxu tensor.register_hook(customHook) may work for your case, you need to write customHook to modify grad of the tensor

Hi Yanli,

I am not sure whether tensor.register_hook will work, but the documentation mentioned that,

Forward and backward hooks defined on module and its submodules won’t be invoked anymore, unless the hooks are initialized in the forward() method.

Besides I need to first collect the whole gradient and then do some modification. Now I am turning to torch.distributed.all_reduce, but it will be easier if there is a way to do this via DistributedDataParallel.

Hi, anxu @anxu , the “DistributedDataParallel” automatically average the gradient when calling “loss.backward()”,
But I didn’t find the corresponding script in pytorch source code, Do you know where it is ?

Hi, Yanli @Yanli_Zhao , the “DistributedDataParallel” automatically average the gradient when calling “loss.backward()”,
But I didn’t find the corresponding script in pytorch source code, Do you know where it is ?

DDP averages gradients by all-reducing them across participating processes (see https://pytorch.org/docs/stable/_modules/torch/distributed/distributed_c10d.html#all_reduce). Some specific bits that include gradient averaging can be found in the allReduce calls here: https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp