Confusion about using .clone

Both models work identically as seen here:

model1 = Net()
model2 = Net2()
model2.load_state_dict(model1.state_dict())

x = torch.randn(1, 3, 24, 24)

outputs1 = model1(x)
outputs2 = model2(x)

# Compare outputs
print((outputs1[0] == outputs2[0]).all())
print((outputs1[1] == outputs2[1]).all())

# Compare gradients
outputs1[0].mean().backward(retain_graph=True)
outputs1[1].mean().backward()
outputs2[0].mean().backward(retain_graph=True)
outputs2[1].mean().backward()

for p1, p2 in zip(model1.parameters(), model2.parameters()):
    print((p1.grad == p2.grad).all())
5 Likes

so to clarify, clone will copy the data to another memory but it has no interfere with gradient back propagation. in other words, when i use clone it will back propagate till the input unless i use detach …

3 Likes

Yes. .clone() is recognized by Autograd and the new tensor will get the grad function as grad_fn=<CloneBackward>.

4 Likes

Why retain_graph=True is used when outputs1[0].mean().backward() is called? Will this accelerate training?

After calling output1[0].mean().backward(), it frees the computation graph which still need in output1[1].mean().backward(). So it needs ‘retain_graph’.

Does this main that there is no difference between “Clone()” and “=” in terms of the gradient copying?

1 Like

I’m not sure I understand your question correctly.
If you are referring to the original question, neither input nor input2 were manipulated after the assignment, so that both model architectures yield the same results.

I mean: when we save a differentiable variable into another with “=”, will this also move the gradients and all the computation graph to the newly defined variable? Or we should use “clone()” for explicitly doing this?

1 Like

The new variable will reference the old one, so that both will see the value changes and gradients:

lin = nn.Linear(1, 1)
w = lin.weight

lin(torch.randn(1, 1)).backward()

print(lin.weight.grad)
print(w.grad)
print(id(w)==id(lin.weight))
4 Likes

I never understood this, what is the point of recording .clone() as an operation? It’s extremely unintuitive to me. When I see clone I expect something like deep copy and getting a fresh new version (copy) of the old tensor. Having copying as an operation in a forward pass (like using the identity) but not calling it the identity is extremely confusion.

Am I correct or do I not understand how .clone() works?

I don’t understand what the use case for it is really.

1 Like

clone can be used e.g. on activations, which should be passed to multiple modules, where each module might manipulate the activation in-place.
Here is a small example:

# Setup
module1 = nn.Sequential(
    nn.ReLU(inplace=True),
    nn.Linear(10, 1))

module2 = nn.Sequential(
    nn.Linear(10, 2))


torch.manual_seed(2809)
act = torch.randn(1, 10)
print(act)

# Wrong, since act will be modified inplace
out1 = module1(act)
print(act)
out2 = module2(act)

# Right
torch.manual_seed(2809)
act = torch.randn(1, 10)
print(act)

# Wrong, since act will be modified inplace
out1 = module1(act.clone())
print(act)
out2 = module2(act)

If you don’t clone the activation, the first module would apply the relu on it and the call to module2 would get the wrong tensor.

If act was created by previous operations (layers), Autograd will properly calculate all gradients.

4 Likes

Let me see if I can paraphrase to see if I got it.

.clone() is useful to create a copy of the original variable that doesn’t forget the history of ops so to allow gradient flow and avoid errors with inlace ops. The main error of in-place ops is overwriting data needed for the backward pass (or writing an in-place op to leaf node, in this case there would be no error message).

4 Likes

Yes, that is correct.
While my example stressed out the inplace operation, clone() might of course be useful in other use cases and specific model architectures (the inplace ops were just the first use case that came to my mind).

If the inplace op is not allowed, PyTorch should raise an error and should not silently fail.
If you encounter such a silent error, please let us know.

5 Likes

Thanks for the explanation! I was wondering if you could give an example of other use cases for cloning besides the inplace operation example. I was trying to think of other cases where it may be necessary/beneficial but couldn’t think of any.

Generally, clone is useful whenever you are dealing with references and would like to use the current value without any potential future changes.
E.g. if you would like to compare values pulled from a state_dict, you would have to use clone() to create the reference values. Otherwise they would be updated in the next optimizer.step() call and your initially stored values would change as well.

That being said, in a “standard CNN training routine” you probably wouldn’t need to call it.

3 Likes

Thank you for your detailed explanation about .clone() method.
I face an issue during training stacked of 2 unet module. the input to the network is a batch of single image and some geometric transformation on it. Then I apply inverse of augmentation to the images (F -1) and compute a loss #1 function. Also, I have loss #2 at the end of the network.

(x + F(x)) --> Unet1 --> loss#1(y+ F-1(y)) --> Unet2 --> loss#2(z+ F-1(z))

In general, I got this error:
“one of the variables needed for gradient computation has been modified by an inplace operation”.
I resolved this error by utilizing y.clone() and then apply inverse geometric transformation.

(x + F(x)) --> Unet1 --> loss#1(y+ F-1(y.clone())) --> Unet2 --> loss#2(z+ F-1(z.clone()))

Now, it seems that it is working well, yet I am suspicious about it. Is it Ok or not? does the gradient flow back to the Unet1 and does the gradient from Unet2 flow back to Unet1?

Note; I use pytorch functions for some geometric transformation, but for others not. I have converted them to numpy to do that.

Based on the error message it seems that F^(-1) might apply some inplace operations on the input tensor, which replace values needed to calculate the gradients. A .clone() operation might solve this issue.
Nevertheless, you could check, if the desired gradients are calculated by calling backward() on intermediate outputs or just the final loss and checking the .grad attributes of all parameters, which should get gradients.
If some of these .grad attributes are returning a None value, Autograd didn’t calculate any gradiets for them and your computation graph might have been broken.

That sounds risky, as this would cut the computation graph, since Autograd isn’t able to backpropagate through numpy operations. You would either need to stick to PyTorch methods or implement a custom autograd.Function as described here.

1 Like

I think in spite of the fact A.clone() have resolved the issue, some of the gradients may be None. In fact, the network does not train properly and the loss#2 just ringing about a certain value. Maybe the computational graph is broken and network does not actually trained.
I will try to stick to PyTorch for geometric transformations.

I believe one of the instances that this notion really makes sense is the skip connection or residual network block where you do not want the gradients of the residuals being messed up with the gradients of the training block.

nope, a more realistic application of clone is in seq2seq model, which related to more than one decode steps.

look at this code from huggingface bart seq2seq model:

there are two branch from input_ids, the first is itself , the second is the decoder_input_ids, which needs shift operation with inplace modification on input_ids. On the other hand, the forward function need keep gradient for all input_ids element, as well as input_ids element in decoder_input_ids.

so you should use clone in this occasion