Clone and detach in v0.4.0

albanD · June 16, 2020, 5:00pm

Hi,

Yes, .detach() gives a new Tensor that is a view of the original one. So any inplace modification of one will affect the other.
You should use .clone() if you want a Tensor with the same content backed with new memory.
And .detach().clone() if you want a new Tensor backward with new memory and that does not share the autograd history of the original one.

pinocchio · June 16, 2020, 6:09pm

That was extremely useful. Though I wasn’t needing anything, I just playing around with the different ops but those are fantastic use cases to mention.

The thing I was shocked to discover is that I can bypass the safety check of in-place operations if I modify the detached version. Look:

def error_unexpected_way_to_by_pass_safety():
    a = torch.tensor([1,2,3.], requires_grad=True)
    # are detached tensor's leafs? yes they are
    a_detached = a.detach()
    #a.fill_(2) # illegal, warns you that a tensor which requires grads is used in an inplace op (so it won't be recorded in computation graph so it wont take the right derivative of the forward path as this op won't be in it)
    a_detached.fill_(2) # weird that this one is allowed, seems to allow me to bypass the error check from the previous comment...?!
    print(f'a = {a}')
    print(f'a_detached = {a_detached}')

Is this a Pytorch bug @albanD? I was shocked to see that was allowed but the first one not since the second seems to be a way to cheat the first one.

Why am I able to change the value of a tensor without the computation graph knowing about it in Pytorch with detach? python - Why am I able to change the value of a tensor without the computation graph knowing about it in Pytorch with detach? - Stack Overflow

pinocchio · June 16, 2020, 6:15pm

Can you explain what this means exactly?

Does “new tensor” mean I get a reference to a new instance of the Tensor class? If in addition it has a view of the “original one” does that mean that it in addition that this new tensor instance internally has a pointer/reference to the data of the other tensor where the actual memory with the data lies in?

albanD · June 16, 2020, 6:16pm

If you try to backward you will get an error. It is just harder to detect during the forward. We only detect it during the backward.

pinocchio · June 16, 2020, 6:18pm

the .backward() did NOT catch the in-place operation on a tensor that is in the forward computation graph @albanD

def error_unexpected_way_to_by_pass_safety():
    import torch 
    a = torch.tensor([1,2,3.], requires_grad=True)
    # are detached tensor's leafs? yes they are
    a_detached = a.detach()
    #a.fill_(2) # illegal, warns you that a tensor which requires grads is used in an inplace op (so it won't be recorded in computation graph so it wont take the right derivative of the forward path as this op won't be in it)
    a_detached.fill_(2) # weird that this one is allowed, seems to allow me to bypass the error check from the previous comment...?!
    print(f'a = {a}')
    print(f'a_detached = {a_detached}')
    a.sum().backward()

output:

a = tensor([2., 2., 2.], requires_grad=True)
a_detached = tensor([2., 2., 2.])

Is this a bug? It’s very surprising I did expect .backward() to catch it too.

albanD · June 16, 2020, 6:20pm

Sorry, I read it too quickly.
.detach() detaches the returned Tensor history and does not require gradients. So there is nothing wrong with modifying this Tensor inplace.

My original answer applies to view operations that are differentiable.

pinocchio · June 16, 2020, 6:23pm

It’s modifying the original tensor in the graph! Why do you think that’s ok? I just changed the memory of a without autograd knowing about it. Isn’t that bad?

albanD · June 16, 2020, 6:27pm

The whole point of .detach() and torch.no_grad() is to be able to do that.
If you want your ops to be differentiable, you shouldn’t use these constructs.

pinocchio · June 16, 2020, 6:30pm

but it’s modifying the original tensor…that’s what I’m confused about.

I thought .detach() was about cutting off flow of gradients to the original tensors in the computation graph but still being able to use the data of the original (without modifying it in sneaky evil ways that don’t inform the original computation graph).

I guess I don’t see why .detach().clone() === .detach() perhaps. I don’t see a use for .detach(), specially if it has these unexpected behaviors.

albanD · June 16, 2020, 6:34pm

In most cases, you don’t want to modify the detached Tensor inplace. And so doing a clone would be much slower for no added benefit.
The doc mentions that this is a view very clearly.

pinocchio · June 16, 2020, 6:41pm

I appreciate your help. I hope you appreciate I am trying to help because I am genuinely confused.

The docs say for detach:

NOTE: Returned Tensor shares the same storage with the original one. In-place modifications on either of them will be seen, and may trigger errors in correctness checks. IMPORTANT NOTE: Previously, in-place size / stride / storage changes (such as resize_ / resize_as_ / set_ / transpose_) to the returned tensor also update the original tensor. Now, these in-place changes will not update the original tensor anymore, and will instead trigger an error. For sparse tensors: In-place indices / values changes (such as zero_ / copy_ / add_) to the returned tensor will not update the original tensor anymore, and will instead trigger an error.

the part that confuses me is:

In-place modifications on either of them will be seen, and may trigger errors in correctness checks.

So it says “may” saying that it not always might throw errors. I am just trying to understand why it did not throw an error in a case I thought was a unambiguous error. But it’s not an error which indicates a flaw of my understanding of the use .detach which worries me.

When does it and not throw error is what is not clearly explained. “May throw errors” is ambiguous. What are the formal semantics for when errors are thrown on in-place operations when one uses detach is what I am after here.

albanD · June 16, 2020, 6:49pm

I think the confusion is what “correctness checks” are.
If the user changes the values of the Tensor inplace and then use it. We don’t consider that to be an error. If you change some values while explicitly hiding it from the autograd with .detach(), we assume you have a good reason to do so.

What happens though is that the forward pass needs to save some Tensor values to be able to compute the backward pass. If you modify one of these saved Tensors before running the backward, then in that case, you will get an error. Because the original value was needed to compute the right gradients and it does not exist anymore (was modified inplace).
For example here, the output of exp() is required in the backward, so if we modify it inplace, you get an error:

>>> a = torch.rand(10, requires_grad=True)
>>> b = a.exp()
>>> b.mul_(2)
tensor([5.0635, 3.1801, 2.5123, 2.1725, 2.6194, 2.4245, 4.1136, 5.2920, 3.9636,
        3.0117], grad_fn=<MulBackward0>)
>>> b.sum().backward()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/albandes/workspace/pytorch_dev/torch/tensor.py", line 183, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/Users/albandes/workspace/pytorch_dev/torch/autograd/__init__.py", line 125, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [10]], which is output 0 of ExpBackward, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

pinocchio · June 16, 2020, 9:17pm

Yes that is exactly what I expected! So if I have a new intermediate tensor and I try to act on it with any in-place operation that should throw an error, right?

Well I think I found a counter example. It doesn’t hold for .clone():

def clone_playground():
    import torch

    a = torch.tensor([1,2,3.], requires_grad=True)
    a_clone = a.clone()
    print(f'a is a_clone = {a is a_clone}')
    print(f'a == a_clone = {a == a_clone}')
    print(f'a = {a}')
    print(f'a_clone = {a_clone}')
    #a_clone.fill_(2)
    a_clone.mul_(2)
    print(f'a = {a}')
    print(f'a_clone = {a_clone}')
    a_clone.sum().backward()

output:

a is a_clone = False
a == a_clone = tensor([True, True, True])
a = tensor([1., 2., 3.], requires_grad=True)
a_clone = tensor([1., 2., 3.], grad_fn=<CloneBackward>)
a = tensor([1., 2., 3.], requires_grad=True)
a_clone = tensor([2., 4., 6.], grad_fn=<MulBackward0>)

why is .clone special? If it’s added to the autograd graph it should be just like any other op.

albanD · June 16, 2020, 9:22pm

Because the output of clone is not required during the backward pass. So there is not reason to through an error if it was changed inplace.
This is why I used exp() in my example.

pinocchio · June 16, 2020, 9:29pm

What do you mean it’s not used in the backward pass?

I am explicitly adding the contents of the a_clone vector. It’s explicitly required in the backward because I am using it as the input to sum which I then take the derivative of:

a_clone.sum().backward()

pinocchio · June 16, 2020, 9:36pm

Also why is mul in the computation graph if it was a in-pace operation? I thought those were not recorded by autograd.

albanD · June 16, 2020, 9:42pm

What do you mean it’s not used in the backward pass?

To compute the backward of exp(x), we need to compute grad_out * exp(x). So we re-use the result of the forward instead of recomputing it for performance reasons.
For clone, the backward just needs to compute grad_out (a no-op) and so no need to save the value of any Tensor from the forward.

I thought those were not recorded by autograd.

All the ops (except if you use .detach() or torch.no_grad()) are recorded. Otherwise, we wouldn’t be able to compute the correct gradients.

pinocchio · June 16, 2020, 9:48pm

So what is wrong with this then?

>>> a.mul_(3)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

just record the mul as usual. But Pytorch is warning me of something…what is wrong with using in-place ops with leafs? I always thought in-place ops were bad because they go directly to the data of a tensor and by-passing the tracking of the graph that autograd does BUT clearly I was very very wrong.

I’m confused as ever X’D

albanD · June 16, 2020, 9:55pm

No, inplace ops work just fine with the autograd in pytorch

The limitation is that if you overwrite a value that was needed, you will get an error and you will have to replace it with an out of place op.
The other one is that inplace on leafs is not allowed either. This is because in this case, it is a Tensor that requires grad and doesn’t have any history. So usually a Tensor for which the user will access the .grad field after calling .backward(). But if you do an inplace on it, the Tensor will contain new values and thus have a history wrt to the autograd (and won’t be a leaf anymore). That means that its .grad field won’t be populated and is most likely not what the user wants. Thus the error that you see.

pinocchio · June 16, 2020, 10:00pm

Oh I see! Basically because it becomes a non-leaf. That makes sense.

I get the gist now. in-place ops are bad when they delete data that Pytorch needs to compute derivatives correctly. That makes sense.

Thanks so much!

The clone thing is still confusing to me but I need to reflect on it to figure out what exactly is confusing me.