Clone and detach in v0.4.0

Does this mean that it also removes the flow of gradients down it’s path? i.e. removes an edge from the graph and thus a path for backprop to flow through?

It is a temporary restriction to prevent people from relying on impls for large arrays on a stable compiler just yet.

In the rare chance that we need to give up on const generics and rip it out of the compiler, it avoids getting into a situation where users are already relying on e.g. IntoIterator for [T; 262143] and we have no way to support them.

When const generics are finally stabilized the restriction would be removed and the impl would apply to arrays of arbitrarily large size.

My apologies ziyad, but I am having a hard time following what you are responding to. Are you responding to my question about gradient flows?

Yes, compare printed graphs:

# http://www.bnikolic.co.uk/blog/pytorch-detach.html

import torch
from torchviz import make_dot

x=torch.ones(10, requires_grad=True)
weights = {'x':x}

y=((x**2)**3)
y=((x**2).detach()**3)
y=((x**2)**3).detach()

z=(x+3)+4
r=(y+z).sum()

make_dot(r,params=weights)

I am trying to understand what “shares storage means”. Does that mean if I detach a tensor from it’s original but then opt.step() to modify the original, then Both would change?

i.e.

a = torch.tensor([1,2,3.], requires_grad=True)
b = a.detach()
opt = optim.Adam(a,lr=0.01)
a.backward()
opt.step() # changes both because they share storage?

So both change because they share storage? That seems an odd semantics to have or I am missing the point of .detach().


here is the migration guide Richard mentioned:


Related useful links:

1 Like

Hi,

Yes, .detach() gives a new Tensor that is a view of the original one. So any inplace modification of one will affect the other.
You should use .clone() if you want a Tensor with the same content backed with new memory.
And .detach().clone() if you want a new Tensor backward with new memory and that does not share the autograd history of the original one.

6 Likes

That was extremely useful. Though I wasn’t needing anything, I just playing around with the different ops but those are fantastic use cases to mention.

The thing I was shocked to discover is that I can bypass the safety check of in-place operations if I modify the detached version. Look:

def error_unexpected_way_to_by_pass_safety():
    a = torch.tensor([1,2,3.], requires_grad=True)
    # are detached tensor's leafs? yes they are
    a_detached = a.detach()
    #a.fill_(2) # illegal, warns you that a tensor which requires grads is used in an inplace op (so it won't be recorded in computation graph so it wont take the right derivative of the forward path as this op won't be in it)
    a_detached.fill_(2) # weird that this one is allowed, seems to allow me to bypass the error check from the previous comment...?!
    print(f'a = {a}')
    print(f'a_detached = {a_detached}')

Is this a Pytorch bug @albanD? I was shocked to see that was allowed but the first one not since the second seems to be a way to cheat the first one.


Why am I able to change the value of a tensor without the computation graph knowing about it in Pytorch with detach? https://stackoverflow.com/questions/62415251/why-am-i-able-to-change-the-value-of-a-tensor-without-the-computation-graph-know

Can you explain what this means exactly?

Does “new tensor” mean I get a reference to a new instance of the Tensor class? If in addition it has a view of the “original one” does that mean that it in addition that this new tensor instance internally has a pointer/reference to the data of the other tensor where the actual memory with the data lies in?

If you try to backward you will get an error. It is just harder to detect during the forward. We only detect it during the backward.

the .backward() did NOT catch the in-place operation on a tensor that is in the forward computation graph :cry: @albanD

def error_unexpected_way_to_by_pass_safety():
    import torch 
    a = torch.tensor([1,2,3.], requires_grad=True)
    # are detached tensor's leafs? yes they are
    a_detached = a.detach()
    #a.fill_(2) # illegal, warns you that a tensor which requires grads is used in an inplace op (so it won't be recorded in computation graph so it wont take the right derivative of the forward path as this op won't be in it)
    a_detached.fill_(2) # weird that this one is allowed, seems to allow me to bypass the error check from the previous comment...?!
    print(f'a = {a}')
    print(f'a_detached = {a_detached}')
    a.sum().backward()

output:

a = tensor([2., 2., 2.], requires_grad=True)
a_detached = tensor([2., 2., 2.])

Is this a bug? It’s very surprising I did expect .backward() to catch it too.

Sorry, I read it too quickly.
.detach() detaches the returned Tensor history and does not require gradients. So there is nothing wrong with modifying this Tensor inplace.

My original answer applies to view operations that are differentiable.

It’s modifying the original tensor in the graph! Why do you think that’s ok? I just changed the memory of a without autograd knowing about it. Isn’t that bad?

The whole point of .detach() and torch.no_grad() is to be able to do that.
If you want your ops to be differentiable, you shouldn’t use these constructs.

but it’s modifying the original tensor…that’s what I’m confused about.

I thought .detach() was about cutting off flow of gradients to the original tensors in the computation graph but still being able to use the data of the original (without modifying it in sneaky evil ways that don’t inform the original computation graph).

I guess I don’t see why .detach().clone() === .detach() perhaps. I don’t see a use for .detach(), specially if it has these unexpected behaviors.

In most cases, you don’t want to modify the detached Tensor inplace. And so doing a clone would be much slower for no added benefit.
The doc mentions that this is a view very clearly.

I appreciate your help. I hope you appreciate I am trying to help because I am genuinely confused.

The docs say for detach:

NOTE: Returned Tensor shares the same storage with the original one. In-place modifications on either of them will be seen, and may trigger errors in correctness checks. IMPORTANT NOTE: Previously, in-place size / stride / storage changes (such as resize_ / resize_as_ / set_ / transpose_) to the returned tensor also update the original tensor. Now, these in-place changes will not update the original tensor anymore, and will instead trigger an error. For sparse tensors: In-place indices / values changes (such as zero_ / copy_ / add_) to the returned tensor will not update the original tensor anymore, and will instead trigger an error.

the part that confuses me is:

In-place modifications on either of them will be seen, and may trigger errors in correctness checks.

So it says “may” saying that it not always might throw errors. I am just trying to understand why it did not throw an error in a case I thought was a unambiguous error. But it’s not an error which indicates a flaw of my understanding of the use .detach which worries me.

When does it and not throw error is what is not clearly explained. “May throw errors” is ambiguous. What are the formal semantics for when errors are thrown on in-place operations when one uses detach is what I am after here.

I think the confusion is what “correctness checks” are.
If the user changes the values of the Tensor inplace and then use it. We don’t consider that to be an error. If you change some values while explicitly hiding it from the autograd with .detach(), we assume you have a good reason to do so.

What happens though is that the forward pass needs to save some Tensor values to be able to compute the backward pass. If you modify one of these saved Tensors before running the backward, then in that case, you will get an error. Because the original value was needed to compute the right gradients and it does not exist anymore (was modified inplace).
For example here, the output of exp() is required in the backward, so if we modify it inplace, you get an error:

>>> a = torch.rand(10, requires_grad=True)
>>> b = a.exp()
>>> b.mul_(2)
tensor([5.0635, 3.1801, 2.5123, 2.1725, 2.6194, 2.4245, 4.1136, 5.2920, 3.9636,
        3.0117], grad_fn=<MulBackward0>)
>>> b.sum().backward()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/albandes/workspace/pytorch_dev/torch/tensor.py", line 183, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/Users/albandes/workspace/pytorch_dev/torch/autograd/__init__.py", line 125, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [10]], which is output 0 of ExpBackward, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Yes that is exactly what I expected! So if I have a new intermediate tensor and I try to act on it with any in-place operation that should throw an error, right?

Well I think I found a counter example. It doesn’t hold for .clone():

def clone_playground():
    import torch

    a = torch.tensor([1,2,3.], requires_grad=True)
    a_clone = a.clone()
    print(f'a is a_clone = {a is a_clone}')
    print(f'a == a_clone = {a == a_clone}')
    print(f'a = {a}')
    print(f'a_clone = {a_clone}')
    #a_clone.fill_(2)
    a_clone.mul_(2)
    print(f'a = {a}')
    print(f'a_clone = {a_clone}')
    a_clone.sum().backward()

output:

a is a_clone = False
a == a_clone = tensor([True, True, True])
a = tensor([1., 2., 3.], requires_grad=True)
a_clone = tensor([1., 2., 3.], grad_fn=<CloneBackward>)
a = tensor([1., 2., 3.], requires_grad=True)
a_clone = tensor([2., 4., 6.], grad_fn=<MulBackward0>)

why is .clone special? If it’s added to the autograd graph it should be just like any other op.

Because the output of clone is not required during the backward pass. So there is not reason to through an error if it was changed inplace.
This is why I used exp() in my example.

What do you mean it’s not used in the backward pass?

I am explicitly adding the contents of the a_clone vector. It’s explicitly required in the backward because I am using it as the input to sum which I then take the derivative of:

a_clone.sum().backward()