When would the error "one of the variables needed for gradient computation has been modified" occur?

JindongJiang · January 7, 2018, 8:35am

It seems that x += 1 will actually modify x in-place. And the problem "one of the variables needed for gradient computation has been modified " did occur when I try to bifurcate the data flow like this

x = self.conv1(input) # main branch
x_p = self.conv1_p(x) # another branch
x += some_layer # main branch

It worked when I change it to:

x = self.conv1(input) # main branch
x_p = self.conv1_p(x) # another branch
x = x + some_layer # main branch

So I thought maybe we should use x = x + everywhere. But then I remember that the official ResNet implementation is actually using the in-place version out += residual and it works.

So I was just wondering when do we have to use x = x + and when should the x += be acceptable.

albanD · January 8, 2018, 10:19am

Here is what happens:

x = self.conv1(input) # main branch
# x is a Variable containing a tensor whose value is the output of conv1

x_p = self.conv1_p(x) # another branch
# Here, x is used as input to conv1_p, since conv1_p will need the value of x
# to compute the backward pass, it marks it to be saved for the backward pass.

x += some_layer # main branch
# Here you try to modify the tensor x inplace, if you do so, it would not contain
# the output of conv1 anymore, and conv1_p would not be able to compute its
# backward, so this is forbidden

In you second example, everything is the same but the last line.
When you do x = x + some_layer, you create a new tensor that will contain the result of x + some_layer then assign this tensor to the python variable x. Here the Tensor x that contains the output of conv1 has not been changed, and so conv1_p can compute it’s backward without problem.

The difference between x = x + and x += is that the first operation creates a new tensor while the second one does not create any new Tensor. So the second one can be considered a small optimization speed and memory wise, but is not always possible.

JindongJiang · January 8, 2018, 10:36am

Thank you so much for your reply.

So my question now is why the official ResNet example use out += residual and works? Shouldn’t it block the gradient passing by modifying out?

albanD · January 8, 2018, 10:46am

Because out here is the output of a batchnorm layer, and batchnorm layer does not require it’s output to be able to perform the backward pass. So you can modify it inplace without any problem.

JindongJiang · January 8, 2018, 11:56am

Got it. Thank you so much!