Will my gradient be computed wrongly if something I need in the middle is modified in place and I just clone it?

I am not familiar with how autograd works at the backend. Currently I have to modify the output of my network before passing to a criterion. My network output two things. The code is as follows:

        x_dot_wT, phi_theta = output
        # must clone else modify in place
        f_y = x_dot_wT.clone()
        batch_size = target.size(0)
        idxs = torch.arange(0, batch_size, dtype=torch.long)
        f_y[idxs, target] = ((_lambda * f_y[idxs, target]) + phi_theta[idxs, target]) / (1 + _lambda) 

        loss = criterion(f_y, target)

I conveniently clone x_dot_wT to avoid the “variable is modified in place” error. I wonder if this will affect autograd negatively and cause my gradient to be computed wrongly. Does it?


As long as you don’t use the now deprecated .data, the autograd engine will detect anything that would lead to wrong gradient being computed. In your case for example, this inplace modification. Cloning here is the right fix and will give you the correct gradient.

Will the gradient computed in this case the same as when I move the modification (code posted) into the network? It seems to me that there will be difference. I have not done enough experiments to conclude though.

Cloning does not change any gradient, wherever you put it, so if the only change you make is putting a .clone() at different position, this will give you the same results.
If you’re doing something else, i’m sorry I don’t understand what you meant?

When I move the block of code into the forward function of my network, I do not have to clone. I am not sure why though (Pytorch doesn’t complain). That’s why I have the question if cloning will cause gradient to be computed differently. Within the network, I do not have to clone. Outside of it, I have to.

Or must I be doing something wrong, that modified in place will be complained wherever the code is placed?

I is possible that when you unpack your network outputs, you create multiple tensors that share the same underlying storage and so doing inplace operations become forbidden.
For example when doing x_dot_wT, phi_theta = output.

“It is possible …” - Does this mean by chance?

It means it depends on the rest of your code, which piece do you move around, what is between the other place where you put it and where it is now etc…

Correct me if I am wrong. Pytorch does not allow in place modification because it needs to know all the operations carried out so that it can compute gradient in the future. So right now, since I clone the variable and do something to it and pass this new variable f_y into the criterion to calculate loss, autograd wouldn’t know these operations that are done to x_dot_wT and would assume that the loss is calculated directly using x_dot_wT and so it would be different from the case where autograd knows these modifications when I do not clone?

So in my code when I do loss.backward(), pytorch thinks that the loss is computed using x_dot_wT and not f_y? Is that the case?

I’m not sure to understand your question here.
Keep in mind that not all inplace operations are forbidden, it depends if the tensor is actually needed. For example, you can change inplace the output of a batchnorm layer, but you cannot change inplace the output of a conv layer.
Note as well that indexing a tensor, concatenating it or splitting it counts as autograd ops, and so may prevent you from changing the tensor inplace (depending on which operation you use).
Whatever happens, the autograd engine will give you the correct gradients for the parameters of your net if it does not raise an error.

Just curious, why sometimes I can modify inplace sometimes I can’t? Is there anywhere that I can read that gives detailed explanation?