Detach, no_grad and requires_grad

detach() detaches the output from the computationnal graph. So no gradient will be backproped along this variable.
torch.no_grad says that no operation should build the graph.

The difference is that one refers to only a given variable on which it’s called. The other affects all operations taking place within the with statement.


good explanation. Can you give some use cases for both? I am guessing you’d use torch.no_grad during eval phase? But what would you use detach() for? Any specific use cases would help understand them better

1 Like

torch.no_grad yes you can use in eval phase in general.

detach() on the other hand should not be used if you’re doing classic cnn like architectures. It is usually used for more tricky operations.
detach() is useful when you want to compute something that you can’t / don’t want to differentiate. Like for example if you’re computing some indices from the output of the network and then want to use that to index a tensor. The indexing operation is not differentiable wrt the indices. So you should detach() the indices before providing them.


Thank you albanD

definitely it helps!

so if i understand properly this part

(please anyone correct me if i’m wrong) we can use both method exchangeably and it should have same result


        with torch.no_grad():
            return policy_net(state).max(1)[1].view(1, 1)

should have similar effect as

       return policy_net(state).detach().max(1)[1].view(1, 1)

thanx again for your help!


The returned result will be the same, but the version with torch.no_grad will use less memory because it knows from the beginning that no gradients are needed so it doesn’t need to keep intermediary results.


You’re right. And this is huge difference.

Thank you

all clear now


@albanD Hi, does detach erase the existing weights?

So if we do:

policy_loss = …

For some reason when I was doing this in a loop the weights of the policy.parameters() [connected to the policy optimizer] stopped changing. I don’t understand the mechanism for why.

The only thing detach does is to return a new Tensor (it does not change the current one) that does not share the history of the original Tensor.
If the only thing you added to your code is policy_loss.detach() then it does not do anything to your code as you don’t use the result and detach does not change anything else.


@albanD I see, any pointers for why the weights aren’t changing? I notice first the biases stop changing, then more and more hidden layers stop changing incrementally.

The gradients were on the order of E-2 so not super small, but perhaps combined with the 3E-4 learning rate the update becomes numerically ~0 and ignored?

That looks like this can happen, that would give you an update around 1e-6 which is where float number start to become meaningless.

Assume I have a network in two parts like:

def forward(self, x):
  x = self.part_1(x)
  return self.part_2(x)

If I want to update only the second part in training phase, will this modification do the desired thing?

def forward(self, x):
  with torch.no_grad():
    x = self.part_1(x)
  return self.part_2(x)

Yes, your code will work.
Alternatively, you could call .detach() on x before passing it to part2:

def forward(self, x):
    x = self.part_1(x)
    return self.part_2(x.detach())

In torch >= 1.0.0, what is the correct way to copy the data of a tensor so that the copy “is just a copy” (i.e. a standalone tensor that doesn’t know anything about graphs, gradients etc.)?

x = my_tensor.detach().clone()
x = my_tensor.clone().detach()

# >>> Now, the two lines above, but augmented with .data in all possible places:
x =
x = my_tensor.detach().data.clone()
# ...
x = my_tensor.clone().detach().data
# <<<

It looks like all of these options solve the problem, don’t they? If it’s true, than the options with .data are all redundant and the question is .detach().clone() vs .clone().detach().

One more case:

def forward(self,x):
  x = self.root(x)
  out1 = self.branch_1(x)
  out2 = self.branch_2(x.detach())
  return out1, out2

loss = F.l2_loss(out1, target1) + F.l2_loss(out2, target2)

I want the gradients for the branch1 to update the parameters of the root and branch1. But the gradients of the second branch should only update the branch_2 parameters. I guess the detach usage will make it happen, but I want to be sure?



Yes this will do what you want !


For example in Faster R-CNN gradients do not backprop through proposals; therefore, once we have computed the proposals in RPN, we can detach them and use the detached tensors (ROIs) for the rest of the networks (

Another usecase is that if there are auxilary losses in the network and if we do not detach the network in proper points (just after these auxiary points), gradients of some parameters’ will be calculated multiple times, which we may not want.

So, If I do .detach() on a variable, then that variable will never be in the computation graph. Right?

The returned Tensors will not.
The original Tensor you called detach() on remains unchanged.

1 Like

Hi @albanD , could you please help me understand why torch,no_grad shall consume lesser memory?

I understand every output operation inside the torch,no_grad block shall have requires_grad=False no matter what the inputs’ requires_grad looks like.
This essentially means no tensors will be saved intermediately which saves memory; but how is detach() different in this sense?

A detached tensor also has its requires_grad=False.
This also means no gradients need to be backpropagated so no saving of intermediate tensors.

What am I missing? I’m also thinking memory consumption comparison between the two shall depend on the specific code at hand, or is there any general comparison as well?

They are the same really. Just that if you use .detach(), you have to do that for every op while if you use the context manager, you can disable it for the whole bloc.
So you should use one or the other depending on what is most convenient for your particular use case.