It seems that x += 1 will actually modify x in-place. And the problem "one of the variables needed for gradient computation has been modified " did occur when I try to bifurcate the data flow like this
x = self.conv1(input) # main branch
x_p = self.conv1_p(x) # another branch
x += some_layer # main branch
It worked when I change it to:
x = self.conv1(input) # main branch
x_p = self.conv1_p(x) # another branch
x = x + some_layer # main branch
So I thought maybe we should use x = x + everywhere. But then I remember that the official ResNet implementation is actually using the in-place version out += residual and it works.
So I was just wondering when do we have to use x = x + and when should the x += be acceptable.
x = self.conv1(input) # main branch
# x is a Variable containing a tensor whose value is the output of conv1
x_p = self.conv1_p(x) # another branch
# Here, x is used as input to conv1_p, since conv1_p will need the value of x
# to compute the backward pass, it marks it to be saved for the backward pass.
x += some_layer # main branch
# Here you try to modify the tensor x inplace, if you do so, it would not contain
# the output of conv1 anymore, and conv1_p would not be able to compute its
# backward, so this is forbidden
In you second example, everything is the same but the last line.
When you do x = x + some_layer, you create a new tensor that will contain the result of x + some_layer then assign this tensor to the python variable x. Here the Tensorx that contains the output of conv1 has not been changed, and so conv1_p can compute it’s backward without problem.
The difference between x = x + and x += is that the first operation creates a new tensor while the second one does not create any new Tensor. So the second one can be considered a small optimization speed and memory wise, but is not always possible.
Because out here is the output of a batchnorm layer, and batchnorm layer does not require it’s output to be able to perform the backward pass. So you can modify it inplace without any problem.