When would the error "one of the variables needed for gradient computation has been modified" occur?

It seems that x += 1 will actually modify x in-place. And the problem "one of the variables needed for gradient computation has been modified " did occur when I try to bifurcate the data flow like this

x = self.conv1(input) # main branch
x_p = self.conv1_p(x) # another branch
x += some_layer # main branch

It worked when I change it to:

x = self.conv1(input) # main branch
x_p = self.conv1_p(x) # another branch
x = x + some_layer # main branch

So I thought maybe we should use x = x + everywhere. But then I remember that the official ResNet implementation is actually using the in-place version out += residual and it works.

So I was just wondering when do we have to use x = x + and when should the x += be acceptable.

Here is what happens:

x = self.conv1(input) # main branch
# x is a Variable containing a tensor whose value is the output of conv1

x_p = self.conv1_p(x) # another branch
# Here, x is used as input to conv1_p, since conv1_p will need the value of x
# to compute the backward pass, it marks it to be saved for the backward pass.

x += some_layer # main branch
# Here you try to modify the tensor x inplace, if you do so, it would not contain
# the output of conv1 anymore, and conv1_p would not be able to compute its
# backward, so this is forbidden

In you second example, everything is the same but the last line.
When you do x = x + some_layer, you create a new tensor that will contain the result of x + some_layer then assign this tensor to the python variable x. Here the Tensor x that contains the output of conv1 has not been changed, and so conv1_p can compute it’s backward without problem.

The difference between x = x + and x += is that the first operation creates a new tensor while the second one does not create any new Tensor. So the second one can be considered a small optimization speed and memory wise, but is not always possible.

Thank you so much for your reply.

So my question now is why the official ResNet example use out += residual and works? Shouldn’t it block the gradient passing by modifying out?

Because out here is the output of a batchnorm layer, and batchnorm layer does not require it’s output to be able to perform the backward pass. So you can modify it inplace without any problem.

Got it. Thank you so much! :+1: