Autograd cannot compute loss.backward()

(Kirk86) #1

Hi folks,
I’ve encountered this error today.

Traceback (most recent call last):
  File "", line 162, in <module>
  File "/home/user/miniconda3/envs/torch/lib/python3.6/site-packages/torch/", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/user/miniconda3/envs/torch/lib/python3.6/site-packages/torch/autograd/", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Here’s a MWE:

def dummy(x):                                                                                                                                                            
    res =, x.transpose(dim0=0, dim1=1))                                                                                                                          
    res = res * -2                                                                                                                                                          
    res = res + (x**2).sum(dim=1)[:, None]                                                                                                                                  
    res = res + (x**2).sum(dim=1)                                                                                                                                           
    res, idxs = torch.max(res, 0)                                                                                                                                           
    res.flatten()[::x.size(0) + 1] = 0.0                                                                                                                                                                                                                                                                                        
    res = torch.sqrt(res)                                                                                                                                                   
    return res

If I swap res = torch.sqrt(res) with res = res**(1/2) everything seems to be working properly. I am missing sth fundamental here, but dunno what? Probably the way autograd works internally?
Any insights would be highly appreciated!

(Juan F Montesinos) #2

I guess your mainly problem is you are modifing res inplace here:

res.flatten()[::x.size(0) + 1] = 0.0 

That’s not backpropagable

(Kirk86) #3

Thanks for the help. Are you sure though that’s the problem? If you look at my comment if I change torch.sqrt(res) into res**(1/2) without modifying res.flatten()[::x.size(0) + 1] = 0.0 everything seems to be working. Any ideas why?

Also if indeed this

is the conflicting line which is the appropriate way of modifying it:
should each operation be placed on a separate line?

res = res.flatten()
res[::x.size(0) + 1] = 0.0 --> maybe: temp = res[::x.size(0) + 1], temp= 0.0

(Juan F Montesinos) #4

The problem is I don’t know which function ** calls to. My guess is that it is not checking if the graph is broken or not. For sure you cannot do this: res.flatten()[::x.size(0) + 1] = 0.0. Its not a matter of how you code it. You are hard assigning some values of the tensor as 0, that’s not backproagable. There are no gradients between that and the previous state as you manually hard-coded it.

(Kirk86) #5

Thanks Juan,

If that’s the case then that’s a silent error which should be looked upon from pytorch folks cause it can cause wrongful computations to go unnoticed.

Awesome, that’s where that aha moment strikes you in your face.
I guess, I’ll have to come up with a differentiable approach to do the equivalent of assigning those values.

(Juan F Montesinos) #6

Maybe @albanD may clarify where ** expression points to but you should find an alternative to the hard assignment if i’m not mistaken.


(Alban D) #7


The ** expression routes to the pow operation.

The reason why one works and not the other is because of the way the derivatives are computed:

  • For sqrt, for better performance, the result is reused as you can see here (you don’t need to understand the semantic but the point is that result is used here).
  • For pow (or **), the formula is much more general as you can see here and does not use result.

Both cases will give you the same result.
As you saw, since sqrt uses the value of the output, you are not allowed to modify it inplace. If you use **, the backward will be slower but since it does not use result, it can be changed inplace.

For the assignment, it will mask out some entries in the Tensor (and so no gradient will flow back for these entries) and leave the other ones intact (and so gradients will flow as expected for these ones).
Isn’t that the behavior you were expecting? If not what were you trying to achieve?

(Kirk86) #8

Lovely, thanks for the great explanation. Much appreciated!