One of the variables needed for gradient computation has been modified by an inplace operation error

asha97 · June 24, 2020, 8:17pm

I am trying to run the following SINGAN code, Link but whenever I am training ,following error comes

/pytorch/aten/src/ATen/native/TensorFactories.cpp:361: UserWarning: Deprecation warning: In a future PyTorch release torch.full will no longer return tensors of floating dtype by default. Instead, a bool fill_value will return a tensor of torch.bool dtype, and an integral fill_value will return a tensor of torch.long dtype. Set the optional `dtype` or `out` arguments to suppress this warning.
Warning: Error detected in CudnnConvolutionBackward. Traceback of forward call that caused the error:
  File "main_train.py", line 29, in <module>
    train(opt, Gs, Zs, reals, NoiseAmp)
  File "/DATA/rani.1/PHD/SinGAN/SinGAN/training.py", line 39, in train
    z_curr,in_s,G_curr = train_single_scale(D_curr,G_curr,reals,Gs,Zs,in_s,NoiseAmp,opt)
  File "/DATA/rani.1/PHD/SinGAN/SinGAN/training.py", line 156, in train_single_scale
    fake = netG(noise.clone().detach(),prev)
  File "/DATA/rani.1/PHD/mlproj/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/DATA/rani.1/PHD/SinGAN/SinGAN/models.py", line 60, in forward
    x = self.tail(x)
  File "/DATA/rani.1/PHD/mlproj/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/DATA/rani.1/PHD/mlproj/lib64/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/DATA/rani.1/PHD/mlproj/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/DATA/rani.1/PHD/mlproj/lib64/python3.6/site-packages/torch/nn/modules/conv.py", line 353, in forward
    return self._conv_forward(input, self.weight)
  File "/DATA/rani.1/PHD/mlproj/lib64/python3.6/site-packages/torch/nn/modules/conv.py", line 350, in _conv_forward
    self.padding, self.dilation, self.groups)
 (print_stack at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:60)
Traceback (most recent call last):
  File "main_train.py", line 29, in <module>
    train(opt, Gs, Zs, reals, NoiseAmp)
  File "/DATA/rani.1/PHD/SinGAN/SinGAN/training.py", line 39, in train
    z_curr,in_s,G_curr = train_single_scale(D_curr,G_curr,reals,Gs,Zs,in_s,NoiseAmp,opt)
  File "/DATA/rani.1/PHD/SinGAN/SinGAN/training.py", line 179, in train_single_scale
    errG.backward(retain_graph=True)
  File "/DATA/rani.1/PHD/mlproj/lib64/python3.6/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/DATA/rani.1/PHD/mlproj/lib64/python3.6/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3, 32, 3, 3]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

albanD · June 25, 2020, 2:41pm

Hi,

The problem is that in this for loop: https://github.com/tamarott/SinGAN/blob/e1384a9f6dfa45497f4aed5f3e52466d4200fcfb/SinGAN/training.py#L173 you reuse the same fake commputed with the original netG above.
But when you do the first optimizerG.step(), then you modify the weights of netG inplace and so you can’t backprop through that graph again.
You should recompute fake at every iteration with the new netG.

asha97 · June 26, 2020, 8:23pm

@albanD
I have encountered another problem

SIFID/sifid_score.py:262: RuntimeWarning: Mean of empty slice.
  print('SIFID: ', sifid_values.mean())
/DATA/rani.1/PHD/mlproj/lib64/python3.6/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in true_divide
  ret = ret.dtype.type(ret / rcount)
SIFID:  nan

while trying to get the results for the same code
@albanD Just solved it… sorry for inconvenience