I am using a loss function that contains the gradient of the output w.r.t. to the input of the network, which I obtained with
I am interested in training my model in parallel using the
DistributedDataParallel container. However, as one WARNING in the doc page mentions,
DistributedDataParallel does not support
autograd.grad. If I understand correctly, this is because local model parameters (not the one averaged across devices) will be used if I use
autograd.grad after the
forward call. Of course, this is incorrect.
Looking into the implementation of
DistributedDataParallel, I found the method _sync_params is called at the beginning of the
forward method to sync params across devices. My question is:
Is it OK for me to call _sync_params once more before I use
autograd.grad to compute the gradient of the output w.r.t. to the input and then use it in my loss function? In such, the gradient computation will use the averaged parameters. Is there any caveats?