RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256]] is at version 3; expected version 2 instead

azj.n · December 6, 2023, 8:17am

I am trying to implement NashMTL on a multi-task model. I don’t get this error initially where an MOO is not applied, where after calculating each loss, i immediately just call backward on each loss inside the task loop.

For the method, I am trying to assign each loss inside a losses tensor and i would pass this tensor to first compute the gradients for each loss respect to the shared parameters (custom shared parameter extraction function). Once i have the gradients tensor, this would be passed to the NashMTL calculation where it will derive a weighting for each of the losses. Finally, it would call a backward for the weighted and combined loss. Depending on whether I collect the loss and then apply autograd or directly apply it, I have the same error. This a complicated model and has custom loss function, I am also using DDP module since i am using multiple GPUs.

Using the anomaly detection doesn’t show anything. The Tensors involved in the backward or autograd are the losses, but I don’t understand why the tensor in the error has a shape of 256. Could anyone give me a direction to look into?

KFrank · December 6, 2023, 2:23pm

Hi azj!

Start by working through the discussion of inplace-modification errors and
how to debug them given in this post:

"RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [64, 1]], which is output 0 of AsStridedBackward0, is at version 3; expected version 2 instead. Hint: the backtrace further a autograd

Hi Fahmyadan and Sangyoon! Here are some suggestions about how to track down (and maybe fix) inplace-modification errors. Note that an inplace modification in the forward pass is not necessarily* an error – it depends on whether and how the tensor that was modified is used in the backward pass. Note that inplace operations can be useful for saving memory – if you replace an innocent inplace operation with an out-of-place equivalent, your training will use more memory (and, to a minor e…

If you’re still having problems, tell us what specific debugging techniques
you’ve tried and post the entire error message you get including the
forward-pass traceback produced by anomaly detection.

Best.

K. Frank

azj.n · December 7, 2023, 7:08am

hello, thank you for your reply. I have read many solutions, but since my model is not that simple, It’s harder to find exactly where the error is happening. Here is the main parts of the code that i am dealing with.

I initialize grads and train_losses before the task loop inside batch loop (I only have two tasks).
Compute loss for each, compute gradients for each loss using autograd and save the losses and grads in the variables.
Outside the loop i send these grads and losses to the weighting backward which is the multi-objective optimization method NashMTL (this is from the public MTL framework LibMTL on github).
My error occurs at the final stage of NashMTL where i would perform the final backward

Setting the anomaly detection doesn’t show me anything, I’ve tried putting it on every file that might cause the error.
What i’ve done:

Replaced all the seemingly in place operations inside the Loss function file and where the model is being built (its a yolo based model)
Thought combining the losses into a tensor might be causing the problem, so tried separating each loss and adding it one by one as commented out

My question:
Does it have anything to do with the loss being collected in a tensor? Or is it happening in the loss calculation or model structure?
Also would the shared parameters have anything to do with it? (If its wrong would it cause such problem).
Maybe I am using anomaly detection wrong, how and where should i insert it? It doesnt show anything no matter where i place it.

Please help, thank you

azj.n · December 7, 2023, 9:02am

Additionally, I found out that the tensor in question is in one of the parameter tensors during foward pass or gradient during autograd.grad backward. Before i recheck forward, and loss i want to know if any simple operations that i am performing in the previous images are causing the problem, because before applying the nashmtl, if i just calculate loss and apply .backward() for each task, I dont get any error. So i thought there is something wrong with how im handling the loss tensor afterwards, or the shared parameters.

KFrank · December 7, 2023, 5:12pm

Hi azj!

That’s why I posted a link to an explanation of various causes of such
errors, together with a number of debugging techniques.

Please don’t post screenshots of textual information as doing so breaks
accessibility, searchability, and copy-paste.

Do you mean by this that anomaly detection doesn’t print out a forward-pass
traceback? This seems unlikely.

Try wrapping your entire training loop in a with autograd.detect_anomaly():
block.

Best.

K. Frank

KFrank · December 7, 2023, 5:17pm

Hi azj!

Identifying the problem tensor is an important first step.

As suggested in the debugging post I linked to, print out
problem_tensor._version at the beginning, end, and at various
places in your code.

Then use a divide-and-conquer strategy to locate the specific place
where ._version gets increased. That is the location of your inplace
modification.

Best.

K. Frank

azj.n · December 8, 2023, 6:59am

Thanks, i found the problem. For the Anomaly detection, i tried wrapping the loss calculation, gradient calculation and just placing it at the beginning of the file, it tells me that there is an inplace operation, but did not show me exactly where or any other tracebacks.