Inplace modification yields runtimeerror

CoolRabi · July 9, 2022, 12:54am

Hi, Im a torch newbie and attempting to assist a student with an issue they ran, when attempting to run a RNN for a special application.
The error message is:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [50, 5]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead.

Here is a self contained code which should recreate this error. I know the code sucks, it is just for playing around to get the loss and learning running (thats why everything is hard coded atm) + we are new to torch
I added .clone() and stuff everywhere in hopes to get rid of this error, but nothing worked so far. According to the output, the error is thrown on the like " result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,".

import numpy as np
import torch
from torch import autograd, nn
import torchaudio


torch.autograd.set_detect_anomaly(True)

signal_length = 500



input_size = 5
lamb = 0.5
dummy_signal = torch.from_numpy(np.random.rand(1, signal_length).astype(np.float32))
mfcc_comp = torchaudio.transforms.MFCC(sample_rate = 16000, n_mfcc = 20,melkwargs = {'n_mels':60})


def wrapperkwargs(func, kwargs):
    return func(**kwargs)

def wrapperargs(func, args):
    return func(*args)


class SimpleRNN(nn.Module):
  
  def __init__(self, input_size = input_size, output_size = input_size, unit_type = "LSTM", hidden_size = 50,skip = 1, bias_fl = True, num_layers = 1):
    super(SimpleRNN, self).__init__()
    self.input_size = input_size
    self.output_size = output_size
    self.rec = wrapperargs(getattr(nn, unit_type), [input_size, hidden_size, num_layers])
    self.lin = nn.Linear(hidden_size, output_size, bias=bias_fl)
    self.bias_fl = bias_fl
    self.skip = skip
    self.save_state = True
    self.hidden = None
    torch.autograd.set_detect_anomaly(True)

  def forward(self, x):
    if self.skip:
      # save the residual for the skip connection
     # print('x',x), print('size x', x.size())
      
      res = x[ :,0:self.skip].clone().detach()
      x, self.hidden = self.rec(x.clone(), self.hidden)
      return self.lin(x) + res
    else:   
        x, self.hidden = self.rec(x.clone(), self.hidden.clone()).detach()
        self.hidden = self.hidden.detach()
        return self.lin(x)
 # detach hidden state, this resets gradient tracking on the hidden state
  def detach_hidden(self):
         if self.hidden.__class__ == tuple:
             self.hidden = tuple([h.clone().detach() for h in self.hidden])
         else:
             self.hidden = self.hidden.clone().detach()
  # changes the hidden state to None, causing pytorch to create an all-zero hidden state when the rec unit is called
  def reset_hidden(self):
         self.hidden = None

def my_loss(output,targ, inp):
  

  mfccs_output = mfcc_comp(output)
  mfccs_target = mfcc_comp(targ)
  mfccs_input = mfcc_comp(inp)

  min1 = (mfccs_output[0][:3][:].clone()-mfccs_target[0][:3][:].clone())
  min1 = min1**2
  min1 = torch.mean(min1,1)
  min2 = (mfccs_output[0][12:17][:].clone()-mfccs_input[0][12:17][:].clone())
  min2 = min2**2
  min2 = torch.mean(min2,1)
  
  loss = lamb*sum(min1)-(1-lamb)*sum(min2)

  return 1/loss



model = SimpleRNN()
opt = torch.optim.Adam(model.parameters(), lr = 0.001)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(opt, 'min', factor=0.5, patience=5, verbose=True)
total_loss  = 0

for epoch in range(1): 
    for signal in range(1): 
        for batch in range(0,2):             
            input_reshape = torch.reshape(dummy_signal, (100,5))
            out_reshape = model(input_reshape.clone())
            out = torch.reshape(out_reshape, (1, 500))

            opt.zero_grad()
            loss = my_loss(out,dummy_signal,dummy_signal)
            loss.backward(retain_graph=True)
            opt.step() 
            print(loss)
            total_loss += loss.item()

Anyone got an idea what to change to avoid this error? I tried all kinds of solutions I found on the internet, but none has worked so far.

Also, perhaps you know a better way to implement the following:
The RNN is supposed to have an input size N and should process L samples, with L >> N until the loss function is called and the gradient step is performed.Is there a straight forward method to do this other than what we attempted?

ptrblck · July 9, 2022, 4:26am

Using retain_graph=True is wrong in most cases and usually used to try to fix the:

RuntimeError: Trying to backward through the graph a second time

error. Is this also the case here or do you have a valid use case to retain the graph?
If not, remove it and detach the hidden state via:

  def forward(self, x):
    if self.skip:      
      res = x[ :,0:self.skip].clone().detach()
      x, self.hidden = self.rec(x.clone(), self.hidden)
      self.detach_hidden()
      return self.lin(x) + res

and it should work.

CoolRabi · July 9, 2022, 11:25am

Thank you so much, finally it works So what is going on there under the hood? We started out using RNN code from a github of a paper, and as far as I recall it had this retain_graph option set to true.

For completeness:
Before I got your answer, I was able to make it run by changing

 return self.lin(x) + res

to

 return self.lin(x.clone().detach()) + res

as well. Maybe this will help someone sometime.

edit:
In the original code, we actually stored the output of the RNN in the following manner:

out = model(target)
out_aux[mini_batch%len_outaux] = out.clone()

i.e. we created subsequent outputs of the RNN and concatenated them. Now, even with your change, this throws the error
“RuntimeError: Trying to backward through the graph a second time”
the second time the gradient step is supposed to be performed, saying that this line

out_aux[mini_batch%len_outaux] = out.clone()

was the reason (I attempted to clone the tensor, but to no avail).
So the reason for this approach is, that we have a long input vector input, which we want to process with the RNN. The RNN has an input size of 5, just for testing, and we would like to process 250 samples, and then perform the gradient step. For this purpose, we repeatedly predict using the RNN and put the results into a new tensor. What would be the best way to do this and to avoid the error pytorch throws?!

ptrblck · July 10, 2022, 12:40am

retain_graph=True is used if you explicitly want to keep the computation graph alive (and thus allow it to grow in each iteration), but it would need to fit your use case, which is often not the case.
I don’t know what you are working on exactly, so check if you really want to be able to backpropagate through several iterations.

CoolRabi:

Before I got your answer, I was able to make it run by changing
 return self.lin(x) + res
to
 return self.lin(x.clone().detach()) + res
as well. Maybe this will help someone sometime.

You are explicitly detaching the outputs of the nn.LSTM module, which might fix the error, but will also not train the LSTM anymore, which I don’t think is intended.

Make sure that the gradients using a specific output are only calculated once or try to explain what your use case is trying to achieve. I guess the new error is raised since you are appending the outputs again and are trying to call .backward() on the current output as well as the previous outputs (which were already used).