Variables needed for gradient computation has been modified by an inplace operation

yoad.tewel · December 15, 2020, 10:46pm

Hey!
So I got pytorch 1.7.1 with cuda 11, rtx 3080 and ubuntu 20.04 (Installed from binaries).

I tried to train the model from this git: https://github.com/leftthomas/SRGAN
But I got the following error:

> Traceback (most recent call last):
>   File "/home/work/projects/SRGAN/train.py", line 88, in <module>
>     g_loss.backward()
>   File "/home/yoad/anaconda3/envs/torch1.7.1/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
>     torch.autograd.backward(self, gradient, retain_graph, create_graph)
>   File "/home/yoad/anaconda3/envs/torch1.7.1/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
>     Variable._execution_engine.run_backward(
> RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 1024, 1, 1]] is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

So I ran the code with anomaly detections and this is the error:

  0%|          | 0/261 [00:00<?, ?it/s][W python_anomaly_mode.cpp:104] Warning: Error detected in CudnnConvolutionBackward. Traceback of forward call that caused the error:
  File "/home/work/projects/SRGAN/train.py", line 80, in <module>
    fake_out = netD(fake_img).mean()
  File "/home/yoad/anaconda3/envs/torch1.7.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/work/projects/SRGAN/model.py", line 84, in forward
    return torch.sigmoid(self.net(x).view(batch_size))
  File "/home/yoad/anaconda3/envs/torch1.7.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yoad/anaconda3/envs/torch1.7.1/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/home/yoad/anaconda3/envs/torch1.7.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yoad/anaconda3/envs/torch1.7.1/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
  File "/home/yoad/anaconda3/envs/torch1.7.1/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 419, in _conv_forward
    return F.conv2d(input, weight, self.bias, self.stride,
 (function _print_stack)
  0%|          | 0/261 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/work/projects/SRGAN/train.py", line 90, in <module>
    g_loss.backward()
  File "/home/yoad/anaconda3/envs/torch1.7.1/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/yoad/anaconda3/envs/torch1.7.1/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 1024, 1, 1]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Not sure where to look for the inplace operation, any Ideas?

thanks!

ptrblck · December 17, 2020, 6:57am

I think you might be running into a similar issue as described here.
By skimming through the code I’m unsure, why retain_graph is used in d_loss.backward(retain_graph=True) and why fake_img as well as fake_out are recomputed at the end before the optimizerG.step(). Could you explain this workflow a bit and check, if the issue from the link might also apply to your use case?

yoad.tewel · December 17, 2020, 2:21pm

@ptrblck
Thanks for the response, yes I agree that it’s weird that they computed fake_img and fake_out after the loss and optimizerG.step().
I tried 2 different changes that seems to work and while the training now works, I’m not sure if my changes afftect the overall intended training loop and model’s accuracy:
(Changes can be seen here: https://github.com/YoadTew/SRGAN/blob/master/train.py)

I changed the order of fake_img and fake_out, and it fixed the problem.

            netG.zero_grad()

            fake_img = netG(z)
            fake_out = netD(fake_img).mean()

            g_loss = generator_criterion(fake_out, fake_img, real_img)

            g_loss.backward()
            optimizerG.step()

I removed retain_grph=True in d_loss.backward(), and sent fake_img.detach() to create new fake_out for netD.

The thing is, I still dont know the reason for the error in the first place, and if my changes will change the intended flow of the training script.

ptrblck · December 17, 2020, 7:53pm

The error might have been raised, if you tried to compute gradients from already updated parameters and thus stale intermediate activations as explained in the linked post.
Did you compare your current code logic to the example I’ve posted in the other topic?