Confusion about using .clone

isalirezag · March 12, 2019, 8:20pm

considering these two nets:


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        self.conv1 = nn.Conv2d(3, 6, kernel_size=1, stride=1, bias=False)
        self.conv2 = nn.Conv2d(6, 6, kernel_size=1, stride=1, bias=False)
        self.conv3 = nn.Conv2d(6, 6, kernel_size=1, stride=1, bias=False)
        
    def forward(self,input):
        input = F.relu(self.conv1(input))
        input2 = input
        
        output = F.relu(self.conv2(input))
        output2 = F.relu(self.conv3(input2))
        
        return output, output2

        
class Net2(nn.Module):
    def __init__(self):
        super(Net2, self).__init__()
        
        self.conv1 = nn.Conv2d(3, 6, kernel_size=1, stride=1, bias=False)
        self.conv2 = nn.Conv2d(6, 6, kernel_size=1, stride=1, bias=False)
        self.conv3 = nn.Conv2d(6, 6, kernel_size=1, stride=1, bias=False)
        
    def forward(self,input):
        input = F.relu(self.conv1(input))
        input2 = input.clone()
        
        output = F.relu(self.conv2(input))
        output2 = F.relu(self.conv3(input2))
        
        return output, output2

What is the difference between Net and Net2? is the Net model wrong?

When I use clone in Net2, if i do back-propagation (via backward on the loss between output2 and target)
will weights in conv1 get updated as well or it just updates the conv3 weights?

ptrblck · March 12, 2019, 11:43pm

Both models work identically as seen here:

model1 = Net()
model2 = Net2()
model2.load_state_dict(model1.state_dict())

x = torch.randn(1, 3, 24, 24)

outputs1 = model1(x)
outputs2 = model2(x)

# Compare outputs
print((outputs1[0] == outputs2[0]).all())
print((outputs1[1] == outputs2[1]).all())

# Compare gradients
outputs1[0].mean().backward(retain_graph=True)
outputs1[1].mean().backward()
outputs2[0].mean().backward(retain_graph=True)
outputs2[1].mean().backward()

for p1, p2 in zip(model1.parameters(), model2.parameters()):
    print((p1.grad == p2.grad).all())

isalirezag · March 12, 2019, 11:49pm

so to clarify, clone will copy the data to another memory but it has no interfere with gradient back propagation. in other words, when i use clone it will back propagate till the input unless i use detach …

ptrblck · March 12, 2019, 11:51pm

Yes. .clone() is recognized by Autograd and the new tensor will get the grad function as grad_fn=<CloneBackward>.

txytju · May 14, 2019, 6:18am

Why retain_graph=True is used when outputs1[0].mean().backward() is called? Will this accelerate training?

MariosOreo · May 14, 2019, 6:37am

After calling output1[0].mean().backward(), it frees the computation graph which still need in output1[1].mean().backward(). So it needs ‘retain_graph’.

Ahmed_m · May 21, 2019, 3:01pm

Does this main that there is no difference between “Clone()” and “=” in terms of the gradient copying?

ptrblck · May 21, 2019, 3:04pm

I’m not sure I understand your question correctly.
If you are referring to the original question, neither input nor input2 were manipulated after the assignment, so that both model architectures yield the same results.

Ahmed_m · May 21, 2019, 3:07pm

I mean: when we save a differentiable variable into another with “=”, will this also move the gradients and all the computation graph to the newly defined variable? Or we should use “clone()” for explicitly doing this?

ptrblck · May 21, 2019, 10:23pm

The new variable will reference the old one, so that both will see the value changes and gradients:

lin = nn.Linear(1, 1)
w = lin.weight

lin(torch.randn(1, 1)).backward()

print(lin.weight.grad)
print(w.grad)
print(id(w)==id(lin.weight))

pinocchio · June 16, 2020, 4:32pm

I never understood this, what is the point of recording .clone() as an operation? It’s extremely unintuitive to me. When I see clone I expect something like deep copy and getting a fresh new version (copy) of the old tensor. Having copying as an operation in a forward pass (like using the identity) but not calling it the identity is extremely confusion.

Am I correct or do I not understand how .clone() works?

I don’t understand what the use case for it is really.

ptrblck · June 16, 2020, 7:09pm

clone can be used e.g. on activations, which should be passed to multiple modules, where each module might manipulate the activation in-place.
Here is a small example:

# Setup
module1 = nn.Sequential(
    nn.ReLU(inplace=True),
    nn.Linear(10, 1))

module2 = nn.Sequential(
    nn.Linear(10, 2))


torch.manual_seed(2809)
act = torch.randn(1, 10)
print(act)

# Wrong, since act will be modified inplace
out1 = module1(act)
print(act)
out2 = module2(act)

# Right
torch.manual_seed(2809)
act = torch.randn(1, 10)
print(act)

# Wrong, since act will be modified inplace
out1 = module1(act.clone())
print(act)
out2 = module2(act)

If you don’t clone the activation, the first module would apply the relu on it and the call to module2 would get the wrong tensor.

If act was created by previous operations (layers), Autograd will properly calculate all gradients.

pinocchio · June 17, 2020, 6:42pm

Let me see if I can paraphrase to see if I got it.

.clone() is useful to create a copy of the original variable that doesn’t forget the history of ops so to allow gradient flow and avoid errors with inlace ops. The main error of in-place ops is overwriting data needed for the backward pass (or writing an in-place op to leaf node, in this case there would be no error message).

ptrblck · June 18, 2020, 5:37am

Yes, that is correct.
While my example stressed out the inplace operation, clone() might of course be useful in other use cases and specific model architectures (the inplace ops were just the first use case that came to my mind).

If the inplace op is not allowed, PyTorch should raise an error and should not silently fail.
If you encounter such a silent error, please let us know.

Alex-Fabbri · July 19, 2020, 1:57pm

Thanks for the explanation! I was wondering if you could give an example of other use cases for cloning besides the inplace operation example. I was trying to think of other cases where it may be necessary/beneficial but couldn’t think of any.

ptrblck · July 20, 2020, 12:59am

Generally, clone is useful whenever you are dealing with references and would like to use the current value without any potential future changes.
E.g. if you would like to compare values pulled from a state_dict, you would have to use clone() to create the reference values. Otherwise they would be updated in the next optimizer.step() call and your initially stored values would change as well.

That being said, in a “standard CNN training routine” you probably wouldn’t need to call it.

SEM · September 4, 2020, 9:07am

Thank you for your detailed explanation about .clone() method.
I face an issue during training stacked of 2 unet module. the input to the network is a batch of single image and some geometric transformation on it. Then I apply inverse of augmentation to the images (F ^-1) and compute a loss #1 function. Also, I have loss #2 at the end of the network.

(x + F(x)) --> Unet1 --> loss#1(y+ F^-1(y)) --> Unet2 --> loss#2(z+ F^-1(z))

In general, I got this error:
“one of the variables needed for gradient computation has been modified by an inplace operation”.
I resolved this error by utilizing y.clone() and then apply inverse geometric transformation.

(x + F(x)) --> Unet1 --> loss#1(y+ F^-1(y.clone())) --> Unet2 --> loss#2(z+ F^-1(z.clone()))

Now, it seems that it is working well, yet I am suspicious about it. Is it Ok or not? does the gradient flow back to the Unet1 and does the gradient from Unet2 flow back to Unet1?

Note; I use pytorch functions for some geometric transformation, but for others not. I have converted them to numpy to do that.

ptrblck · September 4, 2020, 9:34am

Based on the error message it seems that F^(-1) might apply some inplace operations on the input tensor, which replace values needed to calculate the gradients. A .clone() operation might solve this issue.
Nevertheless, you could check, if the desired gradients are calculated by calling backward() on intermediate outputs or just the final loss and checking the .grad attributes of all parameters, which should get gradients.
If some of these .grad attributes are returning a None value, Autograd didn’t calculate any gradiets for them and your computation graph might have been broken.

That sounds risky, as this would cut the computation graph, since Autograd isn’t able to backpropagate through numpy operations. You would either need to stick to PyTorch methods or implement a custom autograd.Function as described here.

SEM · September 4, 2020, 9:49am

I think in spite of the fact A.clone() have resolved the issue, some of the gradients may be None. In fact, the network does not train properly and the loss#2 just ringing about a certain value. Maybe the computational graph is broken and network does not actually trained.
I will try to stick to PyTorch for geometric transformations.

behzadtabari · September 25, 2023, 1:54pm

I believe one of the instances that this notion really makes sense is the skip connection or residual network block where you do not want the gradients of the residuals being messed up with the gradients of the training block.