My training works if I use mutiple GPUs on a single machine (i.e.
DataParallel). However, if I try to train on two machines (by using
DistributedDataParallel), I get the following error on
one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256, 256, 5, 5]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
torch.autograd.set_detect_anomaly(True) points me to
spectral_norm's code that updates the weight:
File ".../python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ hook(self, input) File ".../python3.6/site-packages/torch/nn/utils/spectral_norm.py", line 99, in __call__ setattr(module, self.name, self.compute_weight(module, do_power_iteration=module.training)) File ".../python3.6/site-packages/torch/nn/utils/spectral_norm.py", line 86, in compute_weight weight = weight / sigma
This definitely does not look like an inplace operation.
The same error occurs even if I use
DistributedDataParallel for a single machine.
Any suggestions or ideas are more than welcome. Thanks in advance.