Is flatten, pool or Conv leading to error?

Chanu_hcu · July 18, 2022, 6:55am

I developed a neural network in PyTorch. While backpropagating loss using .backward, I got the following error

Variable._execution_engine.run_backward(

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate
results have already been freed. Specify retain_graph=True when
calling .backward() or autograd.grad() the first time.

I tried many ways and figured out some minimal steps, for me, that is responsible for the error. But I am not sure which line among these four is exactly creating the issue and why. The lines of code that are leading to error are given below

    t = self.c1(t1)
    t = self.pool(t)
    t = torch.flatten(t,1)
    t = F.relu((self.linear(t)))

here t1 is tensor. c1 is a convolutional layer. I don’t think that last line is causing the issue as there are several linear layers before and after the piece of code I provided that are not causing any issue.

Which line may be causing the error? If there is a need for more information, I am ready to provide it.

Observation:

I came across the following warning in the AUTOGRAD MECHANICS article of PyTorch

Performing inplace operations on the input of any of the functions is
forbidden as they may lead to unexpected side-effects. PyTorch will
throw an error if the input to a pack hook is modified inplace but
does not catch the case where the input to an unpack hook is modified
inplace.

Does it have any relation with the error I am getting? Although there are multiple usages of ReLu in my NN, I checked with the above usage of Relu with inplace=False and the error didn’t go.

thecho7 · July 18, 2022, 7:35am

I reproduced your model like this,

  1 import torch
  2 import torch.nn as nn
  3 import torch.nn.functional as F
  4
  5 class model(nn.Module):
  6     def __init__(self):
  7         super(model, self).__init__()
  8         self.c1 = nn.Conv2d(1, 3, 1)
  9         self.pool = nn.MaxPool2d(2)
 10         self.linear = nn.Linear(48, 32)
 11
 12     def forward(self, t1):
 13         t = self.c1(t1)
 14         t = self.pool(t)
 15         t = torch.flatten(t, 1)
 16         t = F.relu(self.linear(t))
 17
 18         return t
 19
 20 m = model()
 21 a = torch.rand(1, 1, 8, 8)
 22 out = m(a)
 23 loss = out.sum()
 24 loss.backward()

It works well without the backprop error.
I guess the error does not come from layers but the loss function.

Chanu_hcu · July 18, 2022, 7:44am

loss = -model.forward(....inputs.....).mean()

I am using .mean instead of sum. So, I don’t think it is leading to error. If I comment out those layers I mentioned, the program is running without an error. So, I think the layers are somehow responsible for the error I am getting.

Chanu_hcu · July 20, 2022, 10:21am

Recently I updated all the .backward in my code with retain_graph=True. And then placed the backward step crating issue inside with torch.autograd.set_detect_anomaly(True): Then the error got narrowed down and is as follows

[W python_anomaly_mode.cpp:85] Warning: Error detected in CudnnConvolutionBackward. No forward pass information available. Enable detect anomaly during forward pass for more information. (function _print_stack)

…

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [3, 32, 3, 3]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

As I suspected, the issue is all due to some inplace operation. But if I search for the inplace operation with the given tensor size, I cannot able to find any.

Now, I have two questions:

1 Why the code is working well if I comment on the layers I mentioned in the first post?

2 How to find the exact inplace operation leading to error?

Chanu_hcu · July 21, 2022, 11:49am

To know the exact line, I tried to introduce a random tensor after the pool operation, and then the code led to the error shown below. I tried with random tensors after and before every operation. Except after pool everywhere, the same error (mentioned in post #1) is occurring.

    t = self.c1(t1)
    t = self.pool(t)
    t = torch.FloatTensor(t.shape).to('cuda:0')
    t = torch.flatten(t,1)
    t = F.relu((self.linear(t)))

Error

RuntimeError: Function ‘AddmmBackward0’ returned nan values in its 2th output.

ptrblck · July 21, 2022, 11:16pm

Don’t use retain_graph=True unless you explicitly need it and can explain why, as it usually yields other errors.

This is one expected error of using retain_graph=True as the optimizer.step() will update the parameters inplace. The backward call in the next iteration will then try to use the already updated parameters with the old forward activations of the first iteration to calculate the gradients, which is mathematically wrong.

Note that you are not creating a random tensor, but an uninitialized one which can contain invalid values (such as NaN, Inf, etc.).

Chanu_hcu · July 22, 2022, 11:38am

The error is not due to the layers I used, but the input tensor t1 is leading to error. As t1 is the output of another neural network. It is necessary to call .detach() on t1 before passing it to the current neural network.