Hi!
My training works if I use mutiple GPUs on a single machine (i.e. DataParallel
). However, if I try to train on two machines (by using DistributedDataParallel
), I get the following error on .backward()
:
one of the variables needed for gradient computation has been modified by
an inplace operation: [torch.cuda.FloatTensor [256, 256, 5, 5]] is at version 2;
expected version 1 instead. Hint: the backtrace further above shows the
operation that failed to compute its gradient. The variable in question was
changed in there or anywhere later. Good luck!
torch.autograd.set_detect_anomaly(True)
points me to spectral_norm
's code that updates the weight:
File ".../python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
hook(self, input)
File ".../python3.6/site-packages/torch/nn/utils/spectral_norm.py", line 99, in __call__
setattr(module, self.name, self.compute_weight(module, do_power_iteration=module.training))
File ".../python3.6/site-packages/torch/nn/utils/spectral_norm.py", line 86, in compute_weight
weight = weight / sigma
This definitely does not look like an inplace operation.
The same error occurs even if I use DistributedDataParallel
for a single machine.
Any suggestions or ideas are more than welcome. Thanks in advance.
Versions
PyTorch: 1.1.0
CUDA: 9.0.176