of course I understand we don’t want to compute gradients here - but i don’t fully understand the difference between all those 3 methods…
also If i’m not mistaken in previous versions of pytorch we used volatile=true which was considered more memory efficient (please correct me if i’m wrong) which is now replaced by with torch.no_grad():
so if we used with torch.no_grad(): in optimize_mode, would it be also ok?
this answer you mentioned helped me understand general idea but still i don’t fully understand the difference between torch.no_grad and detach
if i can use it exchangeably - or are there any limitations?
detach() detaches the output from the computationnal graph. So no gradient will be backproped along this variable. torch.no_grad says that no operation should build the graph.
The difference is that one refers to only a given variable on which it’s called. The other affects all operations taking place within the with statement.
good explanation. Can you give some use cases for both? I am guessing you’d use torch.no_grad during eval phase? But what would you use detach() for? Any specific use cases would help understand them better
torch.no_grad yes you can use in eval phase in general.
detach() on the other hand should not be used if you’re doing classic cnn like architectures. It is usually used for more tricky operations. detach() is useful when you want to compute something that you can’t / don’t want to differentiate. Like for example if you’re computing some indices from the output of the network and then want to use that to index a tensor. The indexing operation is not differentiable wrt the indices. So you should detach() the indices before providing them.
The returned result will be the same, but the version with torch.no_grad will use less memory because it knows from the beginning that no gradients are needed so it doesn’t need to keep intermediary results.
For some reason when I was doing this in a loop the weights of the policy.parameters() [connected to the policy optimizer] stopped changing. I don’t understand the mechanism for why.
The only thing detach does is to return a new Tensor (it does not change the current one) that does not share the history of the original Tensor.
If the only thing you added to your code is policy_loss.detach() then it does not do anything to your code as you don’t use the result and detach does not change anything else.
@albanD I see, any pointers for why the weights aren’t changing? I notice first the biases stop changing, then more and more hidden layers stop changing incrementally.
The gradients were on the order of E-2 so not super small, but perhaps combined with the 3E-4 learning rate the update becomes numerically ~0 and ignored?
In torch >= 1.0.0, what is the correct way to copy the data of a tensor so that the copy “is just a copy” (i.e. a standalone tensor that doesn’t know anything about graphs, gradients etc.)?
x = my_tensor.detach().clone()
x = my_tensor.clone().detach()
# >>> Now, the two lines above, but augmented with .data in all possible places:
x = my_tensor.data.detach().clone()
x = my_tensor.detach().data.clone()
# ...
x = my_tensor.clone().detach().data
# <<<
It looks like all of these options solve the problem, don’t they? If it’s true, than the options with .data are all redundant and the question is .detach().clone() vs .clone().detach().
I want the gradients for the branch1 to update the parameters of the root and branch1. But the gradients of the second branch should only update the branch_2 parameters. I guess the detach usage will make it happen, but I want to be sure?
Another usecase is that if there are auxilary losses in the network and if we do not detach the network in proper points (just after these auxiary points), gradients of some parameters’ will be calculated multiple times, which we may not want.