Runtime:Error:one of the variables needed for gradient computation has been modified by an inplace operation

Ashima_Kalathingal · July 14, 2024, 2:08pm

Hello,
I am trying to implement a Neural ODE in pytorch and I am getting the following error. I am new to Pytorch and I am not able to understand where the error is coming from.
The error is :
one of the variables needed for gradient computation has been modified by an inplace operation: [torch.DoubleTensor [100, 3]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

This is my code snippet below.
class NeuralODE(nn.Module):
def init(self,x1,x2):
super(NeuralODE,self).init()
self.x1 = x1
self.x2 = x2
l = 100
self.f = nn.Sequential(nn.Linear(3,l),nn.Tanh(),nn.Linear(l,1))

  def forward(self,t,x):
      ifunc   = self.x1
      t_numpy = np.array([t.item()]) if t.dim() == 0 else t.detach().numpy()
      i       = torch.DoubleTensor([ifunc(t_numpy)]).reshape(-1,1)
      x2    = (self.x2).clone().reshape(-1,1)

      #dy/dt 
      C = 5
      dy_dt = (-1/C)*i
      S       = x[0,:].clone().reshape(-1,1) 
      T       = x[1,:].clone().reshape(-1,1) 

      
      i_S_T = torch.cat((i,S,T),dim=1)

      # dp/dt
      C1    = 0.0015397895
      C 2   = 0.020306583

      Q = self.f(i_S_T) # Finding Q through the neural network
      dp_dt = (-C1*(T-x1))+(C2*Q)

      return torch.cat((dy_dt,dp_dt),dim=0)

Can you please help me debug it?

KFrank · July 14, 2024, 10:26pm

Hi Ashima!

The reported shape of the tensor, [100, 3], can be a useful hint – see
below.

For some explanation about how inplace-modification errors occur and
some techniques to debug them, see this post:

"RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [64, 1]], which is output 0 of AsStridedBackward0, is at version 3; expected version 2 instead. Hint: the backtrace further a autograd

Hi Fahmyadan and Sangyoon! Here are some suggestions about how to track down (and maybe fix) inplace-modification errors. Note that an inplace modification in the forward pass is not necessarily* an error – it depends on whether and how the tensor that was modified is used in the backward pass. Note that inplace operations can be useful for saving memory – if you replace an innocent inplace operation with an out-of-place equivalent, your training will use more memory (and, to a minor e…

You can sometimes “automatically” fix such errors by using pytorch’s
sweep-inplace-modification-errors-under-the-rug context manager, but
it’s probably good practice to track down and understand the cause of
the error, even if you do use this solution.

It’s quite possible that the shape-[100, 3] tensor reported in the error
message is self.f[0].weight (that is, Linear (3. l).weight). You
are probably training your NeuralODE and you should be aware that calling
optimizer.step() on your model will modify that Linear (3. l).weight
inplace. Are you using .backward (retain_graph = True) anywhere?

Best.

K. Frank

Ashima_Kalathingal · July 15, 2024, 9:42am

Hello,
Thank you for your reply.

I am using loss.backward(retain_graph = True) to backpropogate the loss. When I try to implement just loss.backward() it shows the following error:

Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

I don’t understand how to work around this. Any help would be much appreciated.

Thank you again.

Ashima

Ashima_Kalathingal · July 15, 2024, 11:33am

I printed out the versions of the weights and biases of the neural network. It changes from 1 to 2 in the next epoch. So as my error indicates the expected version is 1

I modified the neural network in my code like below. I added clone to the weights. But it did not change the outcome. It is still showing the same error.

      y1    =  torch.nn.functional.linear(i_S_T,self.lin1.weight.clone(),self.lin1.bias)
      
      y2    = self.tanh(y1)
      
      Q     = torch.nn.functional.linear(y2,self.lin2.weight.clone(),self.lin2.bias)

KFrank · July 15, 2024, 4:18pm

Hi Ashima!

Just to be clear, to me “epoch” means that you have iterated over the
entire training set once, but that the training set consists of many batches.

Typically, you would perform one optimization step for each batch:

input, labels = get_batch (...)     # one batch, not whole training set
output = model (input)              # forward pass
loss = loss_fn (output, labels)
opt.zero_grad()
loss.backward()                     # backward pass
opt.step()                          # optimization step

If your error truly shows up only after an entire epoch, rather than occurring
after just a single batch, then you are likely doing something incorrect at the
end (or beginning) of the loop over batches that makes up an epoch.

(A common approach is to compute performance metrics for your validation
set and maybe also for your training set after each epoch. It would likely be
an error if any such per-epoch computations contained a loss.backward()
or opt.step() and something like that could be causing your issue.)

Moving on from the question of whether your error happens within your
loop over batches or only once per epoch:

Do you have more than one loss.backward() (or similar) line of code?
If so, why?

Do you have more than one opt.step() (or similar) line of code, and if
so, why?

In your original post, self.f is a Sequential. But in the code you post
below, you have self.lin1 and self.lin2. Let me use lin1 and lin2
to be concrete:

Please add (where model refers to the instance of NeuralODE that you
are training):

print ('model.lin1.weight._version:', 'model.lin1.weight._version)
print ('model.lin2.weight._version:', 'model.lin2.weight._version)
output = model (input)   # or whatever
print ('model.lin1.weight._version:', 'model.lin1.weight._version)
print ('model.lin2.weight._version:', 'model.lin2.weight._version)

around each call to model (...), and, similarly:

print ('model.lin1.weight._version:', 'model.lin1.weight._version)
print ('model.lin2.weight._version:', 'model.lin2.weight._version)
opt.step()
print ('model.lin1.weight._version:', 'model.lin1.weight._version)
print ('model.lin2.weight._version:', 'model.lin2.weight._version)

around each call to opt.step().

Please post the exact code fragments where you make these calls and
please post the exact output you get from the print statements.

Please also run with a with torch.autograd.detect_anomaly(): context
manager and post the full inplace-modification error message, including the
forward-call Traceback that anomaly detection gives you.

If it fits within, say, ten or twenty lines, please post your exact code where
you execute the equivalent of:

output = model (input)
loss = loss_fn (output, labels)
opt.zero_grad()
loss.backward()
opt.step()

Also, just to double-check, please print out model.lin1.shape and
model.lin2.shape at some point after you instantiate model.

I modified the neural network in my code like below. I added clone to the weights. But it did not change the outcome. It is still showing the same error.
      y1    =  torch.nn.functional.linear(i_S_T,self.lin1.weight.clone(),self.lin1.bias)
      
      y2    = self.tanh(y1)
      
      Q     = torch.nn.functional.linear(y2,self.lin2.weight.clone(),self.lin2.bias)

This would be a correct way to fix certain specific inplace-modification
errors, but maybe your error has a somewhat different cause.

As an aside, please use three backticks, ```, to correctly format your code
and output text.

(I use “```python” for code and

“```text” for text.)

Best.

K. Frank

Ashima_Kalathingal · July 16, 2024, 2:24pm

Hello, I was able to debug my code. I made a very silly mistake. I will outline my mistake here.
The code is below

loss_batch = 0
for epoch in range(50):
      optimizer.zero_grad()
      output = model(input)
      loss = loss_fn(output,labels)
      loss_batch = loss_batch+loss
      loss_batch.backward()
      optimizer.step()

So I put loss_batch = 0 outside my training loop which was the wrong thing to do. I corrected it and the code works fine.

Thanks for the help.

`