Hi,
I am using a loss function that contains the gradient of the output w.r.t. to the input of the network, which I obtained with autograd.grad
.
I am interested in training my model in parallel using the DistributedDataParallel
container. However, as one WARNING in the doc page mentions, DistributedDataParallel
does not support autograd.grad
. If I understand correctly, this is because local model parameters (not the one averaged across devices) will be used if I use autograd.grad
after the forward
call. Of course, this is incorrect.
Looking into the implementation of DistributedDataParallel
, I found the method _sync_params is called at the beginning of the forward
method to sync params across devices. My question is:
Is it OK for me to call _sync_params once more before I use autograd.grad
to compute the gradient of the output w.r.t. to the input and then use it in my loss function? In such, the gradient computation will use the averaged parameters. Is there any caveats?