Clone and detach in v0.4.0

If you try to backward you will get an error. It is just harder to detect during the forward. We only detect it during the backward.

the .backward() did NOT catch the in-place operation on a tensor that is in the forward computation graph :cry: @albanD

def error_unexpected_way_to_by_pass_safety():
    import torch 
    a = torch.tensor([1,2,3.], requires_grad=True)
    # are detached tensor's leafs? yes they are
    a_detached = a.detach()
    #a.fill_(2) # illegal, warns you that a tensor which requires grads is used in an inplace op (so it won't be recorded in computation graph so it wont take the right derivative of the forward path as this op won't be in it)
    a_detached.fill_(2) # weird that this one is allowed, seems to allow me to bypass the error check from the previous comment...?!
    print(f'a = {a}')
    print(f'a_detached = {a_detached}')
    a.sum().backward()

output:

a = tensor([2., 2., 2.], requires_grad=True)
a_detached = tensor([2., 2., 2.])

Is this a bug? It’s very surprising I did expect .backward() to catch it too.

Sorry, I read it too quickly.
.detach() detaches the returned Tensor history and does not require gradients. So there is nothing wrong with modifying this Tensor inplace.

My original answer applies to view operations that are differentiable.

It’s modifying the original tensor in the graph! Why do you think that’s ok? I just changed the memory of a without autograd knowing about it. Isn’t that bad?

The whole point of .detach() and torch.no_grad() is to be able to do that.
If you want your ops to be differentiable, you shouldn’t use these constructs.

but it’s modifying the original tensor…that’s what I’m confused about.

I thought .detach() was about cutting off flow of gradients to the original tensors in the computation graph but still being able to use the data of the original (without modifying it in sneaky evil ways that don’t inform the original computation graph).

I guess I don’t see why .detach().clone() === .detach() perhaps. I don’t see a use for .detach(), specially if it has these unexpected behaviors.

In most cases, you don’t want to modify the detached Tensor inplace. And so doing a clone would be much slower for no added benefit.
The doc mentions that this is a view very clearly.

I appreciate your help. I hope you appreciate I am trying to help because I am genuinely confused.

The docs say for detach:

NOTE: Returned Tensor shares the same storage with the original one. In-place modifications on either of them will be seen, and may trigger errors in correctness checks. IMPORTANT NOTE: Previously, in-place size / stride / storage changes (such as resize_ / resize_as_ / set_ / transpose_) to the returned tensor also update the original tensor. Now, these in-place changes will not update the original tensor anymore, and will instead trigger an error. For sparse tensors: In-place indices / values changes (such as zero_ / copy_ / add_) to the returned tensor will not update the original tensor anymore, and will instead trigger an error.

the part that confuses me is:

In-place modifications on either of them will be seen, and may trigger errors in correctness checks.

So it says “may” saying that it not always might throw errors. I am just trying to understand why it did not throw an error in a case I thought was a unambiguous error. But it’s not an error which indicates a flaw of my understanding of the use .detach which worries me.

When does it and not throw error is what is not clearly explained. “May throw errors” is ambiguous. What are the formal semantics for when errors are thrown on in-place operations when one uses detach is what I am after here.

I think the confusion is what “correctness checks” are.
If the user changes the values of the Tensor inplace and then use it. We don’t consider that to be an error. If you change some values while explicitly hiding it from the autograd with .detach(), we assume you have a good reason to do so.

What happens though is that the forward pass needs to save some Tensor values to be able to compute the backward pass. If you modify one of these saved Tensors before running the backward, then in that case, you will get an error. Because the original value was needed to compute the right gradients and it does not exist anymore (was modified inplace).
For example here, the output of exp() is required in the backward, so if we modify it inplace, you get an error:

>>> a = torch.rand(10, requires_grad=True)
>>> b = a.exp()
>>> b.mul_(2)
tensor([5.0635, 3.1801, 2.5123, 2.1725, 2.6194, 2.4245, 4.1136, 5.2920, 3.9636,
        3.0117], grad_fn=<MulBackward0>)
>>> b.sum().backward()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/albandes/workspace/pytorch_dev/torch/tensor.py", line 183, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/Users/albandes/workspace/pytorch_dev/torch/autograd/__init__.py", line 125, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [10]], which is output 0 of ExpBackward, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Yes that is exactly what I expected! So if I have a new intermediate tensor and I try to act on it with any in-place operation that should throw an error, right?

Well I think I found a counter example. It doesn’t hold for .clone():

def clone_playground():
    import torch

    a = torch.tensor([1,2,3.], requires_grad=True)
    a_clone = a.clone()
    print(f'a is a_clone = {a is a_clone}')
    print(f'a == a_clone = {a == a_clone}')
    print(f'a = {a}')
    print(f'a_clone = {a_clone}')
    #a_clone.fill_(2)
    a_clone.mul_(2)
    print(f'a = {a}')
    print(f'a_clone = {a_clone}')
    a_clone.sum().backward()

output:

a is a_clone = False
a == a_clone = tensor([True, True, True])
a = tensor([1., 2., 3.], requires_grad=True)
a_clone = tensor([1., 2., 3.], grad_fn=<CloneBackward>)
a = tensor([1., 2., 3.], requires_grad=True)
a_clone = tensor([2., 4., 6.], grad_fn=<MulBackward0>)

why is .clone special? If it’s added to the autograd graph it should be just like any other op.

Because the output of clone is not required during the backward pass. So there is not reason to through an error if it was changed inplace.
This is why I used exp() in my example.

What do you mean it’s not used in the backward pass?

I am explicitly adding the contents of the a_clone vector. It’s explicitly required in the backward because I am using it as the input to sum which I then take the derivative of:

a_clone.sum().backward()

Also why is mul in the computation graph if it was a in-pace operation? I thought those were not recorded by autograd.

What do you mean it’s not used in the backward pass?

To compute the backward of exp(x), we need to compute grad_out * exp(x). So we re-use the result of the forward instead of recomputing it for performance reasons.
For clone, the backward just needs to compute grad_out (a no-op) and so no need to save the value of any Tensor from the forward.

I thought those were not recorded by autograd.

All the ops (except if you use .detach() or torch.no_grad()) are recorded. Otherwise, we wouldn’t be able to compute the correct gradients.

So what is wrong with this then?

>>> a.mul_(3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

just record the mul as usual. But Pytorch is warning me of something…what is wrong with using in-place ops with leafs? I always thought in-place ops were bad because they go directly to the data of a tensor and by-passing the tracking of the graph that autograd does BUT clearly I was very very wrong.

I’m confused as ever X’D

No, inplace ops work just fine with the autograd in pytorch :slight_smile:

The limitation is that if you overwrite a value that was needed, you will get an error and you will have to replace it with an out of place op.
The other one is that inplace on leafs is not allowed either. This is because in this case, it is a Tensor that requires grad and doesn’t have any history. So usually a Tensor for which the user will access the .grad field after calling .backward(). But if you do an inplace on it, the Tensor will contain new values and thus have a history wrt to the autograd (and won’t be a leaf anymore). That means that its .grad field won’t be populated and is most likely not what the user wants. Thus the error that you see.

2 Likes

Oh I see! Basically because it becomes a non-leaf. That makes sense.

I get the gist now. in-place ops are bad when they delete data that Pytorch needs to compute derivatives correctly. That makes sense.

Thanks so much!

The clone thing is still confusing to me but I need to reflect on it to figure out what exactly is confusing me.

1 Like

Sorry if this repetitive but I still don’t get it. What is wrong with doing clone first and then detach i.e. .clone().detach()?

If we clone and then detach then we still have a new tensor with it’s own memory and we’ve blocked the gradient flow to the earlier graph.

If we do .detach().clone()

then we create a tensor that shares the same memory but forget the the old gradient flow but then we made a clone of it, so now it has new memory for it (but since its a copy of the detached it still doesn’t have the gradient flow to the earlier part of the graph).

Which seem equivalent. Are they not? Is there an error in my reasoning?

Sorry if this repetitive but I still don’t get it. What is wrong with doing clone first and then detach i.e. .clone().detach() ?

Nothing. They will given an equivalent end result.
The minor optimization of doing detach() first is that the clone operation won’t be tracked: if you do clone first, then the autograd info are created for the clone and after the detach, because they are inaccessible, they are deleted. So the end result is the same, but you do a bit more useless work.
In any meaningful workload you shouldn’t see any perf difference though. So no need to worry too much about it :smiley:

3 Likes

I wish I would have known that there was no difference, but it was hard to know a priori if there was anything subtle I could have missed. Glad to know it’s safe!

Thank you! Everything is finally clear to me. I appreciate your feedback! You’re a boss a this Alban :muscle:

3 Likes