How to debug "gradient computation has been modified by an inplace operation" errors?

Hello folks. So there are a lot of posts on the forum about debugging this particular error. The error seems to occur when a user make some inplace change to a tensor, and this apparently messes up the gradient computation for that tensor. That all makes sense.

My question was how should users debug these types of errors. So I have a model and training loop. The training loop will generate this error when executing the loss.backward() step, which is very far from the actual source of the error. Also, the traceback of the error masks the actual location of the error, and just shows the error at loss.backward().

Now the error message indicates that I should just the torch.autograd.detect_anomaly function or context manager to debug the error. But I have not found a realistic example of using this tool. The example is a toy example of writing a function MyFunc() and then checking its gradients. So it is not a realistic example of how to debug these errors.

Can I use something like pdb to debug these types of errors? In my case I created a Sequence-to-Sequence model with Attention. Now the original sequence-to-sequence code worked and it seems to have a lot of inplace operations. However, when I added a global Attention mechanism to the code, then it started to fail. So it is not clear how adding attention suddenly started causing this error.

If it is helpful, here is the full error message. It gives me a sense of which tensor is causing the issue, but not exactly where.

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace 
operation: [torch.cuda.FloatTensor [64, 1, 7]], which is output 0 of UnsqueezeBackward0, is at version 
729; expected version 725 instead. Hint: enable anomaly detection to find the operation that failed to 
compute its gradient, with torch.autograd.set_detect_anomaly(True).

Any suggestions would be helpful

  1. Identify the tensor. set_detect_anomaly will show the line where it was used, “output 0 of UnsqueezeBackward0” refers to earlier creation, that may be not as useful as shape.
  2. Manually search for further mutating uses of that tensor. Its _version attribute will change somewhere.
  3. Yes, this is annoying and cumbersome.

@googlebot haha, thanks for the tips. This is helpful, since I was not sure if that UnsqueezeBackward0 was something that I could trace or not. Seems like the answer is “not”, based on your answer.

I was able to trace through the model and put .clone() on some assignments, which seemed to make the code work now–albeit slower. I probably have more clones than I need, so I can now go through and remove them until the model breaks again. Haha, yeah definitely a cumbersome thing, but not surprising.

Haha, it is funny because from a numerical computing standpoint, we are taught to prefer in-place operations because they are non-allocating and accelerate the code. So it is funny to ignore that instinct.

The odd thing was that all of the examples I saw of this error occurred when a user assigned something to itself, for examples u[i] = 2*u[i-1] or something like that. In my case I did not have anything like that, so the real issue is probably buried somewhere in the lowered code of one of the function calls or such. But at least I know what to do for these kinds of errors now. Thanks again for helping to decipher the message.