DistributedDataParallel with autograd.grad


I am using a loss function that contains the gradient of the output w.r.t. to the input of the network, which I obtained with autograd.grad.

I am interested in training my model in parallel using the DistributedDataParallel container. However, as one WARNING in the doc page mentions, DistributedDataParallel does not support autograd.grad. If I understand correctly, this is because local model parameters (not the one averaged across devices) will be used if I use autograd.grad after the forward call. Of course, this is incorrect.

Looking into the implementation of DistributedDataParallel, I found the method _sync_params is called at the beginning of the forward method to sync params across devices. My question is:

Is it OK for me to call _sync_params once more before I use autograd.grad to compute the gradient of the output w.r.t. to the input and then use it in my loss function? In such, the gradient computation will use the averaged parameters. Is there any caveats?

The problem here is that DistributedDataParallel performs gradient averaging across processes by hooking into the AccumulateGrad function. This allows for performing averaging for the last most gradients while autograd is still running.

Would it be possible for you to first compute the initial loss, call autograd.backward instead of autograd.grad, and have it accumulate the first order gradients in the model parameters? Then you could detach those and compute something else before letting the optimizer do its thing. If not, then you’ll have to perform your own averaging, I think.