Does this post answer your question?
Does this post answer your question?
thank you for your response.
this answer you mentioned helped me understand general idea but still i don’t fully understand the difference between torch.no_grad and detach
if i can use it exchangeably - or are there any limitations?
detach() detaches the output from the computationnal graph. So no gradient will be backproped along this variable.
torch.no_grad says that no operation should build the graph.
The difference is that one refers to only a given variable on which it’s called. The other affects all operations taking place within the
good explanation. Can you give some use cases for both? I am guessing you’d use torch.no_grad during eval phase? But what would you use detach() for? Any specific use cases would help understand them better
torch.no_grad yes you can use in eval phase in general.
detach() on the other hand should not be used if you’re doing classic cnn like architectures. It is usually used for more tricky operations.
detach() is useful when you want to compute something that you can’t / don’t want to differentiate. Like for example if you’re computing some indices from the output of the network and then want to use that to index a tensor. The indexing operation is not differentiable wrt the indices. So you should
detach() the indices before providing them.
Thank you albanD
definitely it helps!
so if i understand properly this part
(please anyone correct me if i’m wrong) we can use both method exchangeably and it should have same result
with torch.no_grad(): return policy_net(state).max(1).view(1, 1)
should have similar effect as
return policy_net(state).detach().max(1).view(1, 1)
thanx again for your help!
The returned result will be the same, but the version with
torch.no_grad will use less memory because it knows from the beginning that no gradients are needed so it doesn’t need to keep intermediary results.
You’re right. And this is huge difference.
all clear now
@albanD Hi, does detach erase the existing weights?
So if we do:
policy_loss = …
For some reason when I was doing this in a loop the weights of the policy.parameters() [connected to the policy optimizer] stopped changing. I don’t understand the mechanism for why.
The only thing detach does is to return a new Tensor (it does not change the current one) that does not share the history of the original Tensor.
If the only thing you added to your code is
policy_loss.detach() then it does not do anything to your code as you don’t use the result and detach does not change anything else.
@albanD I see, any pointers for why the weights aren’t changing? I notice first the biases stop changing, then more and more hidden layers stop changing incrementally.
The gradients were on the order of E-2 so not super small, but perhaps combined with the 3E-4 learning rate the update becomes numerically ~0 and ignored?
That looks like this can happen, that would give you an update around 1e-6 which is where float number start to become meaningless.
Assume I have a network in two parts like:
def forward(self, x): x = self.part_1(x) return self.part_2(x)
If I want to update only the second part in training phase, will this modification do the desired thing?
def forward(self, x): with torch.no_grad(): x = self.part_1(x) return self.part_2(x)
Yes, your code will work.
Alternatively, you could call
x before passing it to
def forward(self, x): x = self.part_1(x) return self.part_2(x.detach())
In torch >= 1.0.0, what is the correct way to copy the data of a tensor so that the copy “is just a copy” (i.e. a standalone tensor that doesn’t know anything about graphs, gradients etc.)?
x = my_tensor.detach().clone() x = my_tensor.clone().detach() # >>> Now, the two lines above, but augmented with .data in all possible places: x = my_tensor.data.detach().clone() x = my_tensor.detach().data.clone() # ... x = my_tensor.clone().detach().data # <<<
It looks like all of these options solve the problem, don’t they? If it’s true, than the options with
.data are all redundant and the question is
One more case:
def forward(self,x): x = self.root(x) out1 = self.branch_1(x) out2 = self.branch_2(x.detach()) return out1, out2 loss = F.l2_loss(out1, target1) + F.l2_loss(out2, target2) loss.backward()
I want the gradients for the branch1 to update the parameters of the root and branch1. But the gradients of the second branch should only update the branch_2 parameters. I guess the detach usage will make it happen, but I want to be sure?
Yes this will do what you want !
For example in Faster R-CNN gradients do not backprop through proposals; therefore, once we have computed the proposals in RPN, we can detach them and use the detached tensors (ROIs) for the rest of the networks (https://github.com/pytorch/vision/blob/8dfcff745a5bd2d4886716bf0deff8dcc8e75fed/torchvision/models/detection/rpn.py#L491).
Another usecase is that if there are auxilary losses in the network and if we do not detach the network in proper points (just after these auxiary points), gradients of some parameters’ will be calculated multiple times, which we may not want.
So, If I do
.detach() on a variable, then that variable will never be in the computation graph. Right?
The returned Tensors will not.
The original Tensor you called detach() on remains unchanged.