When to use detach

Krishna_Garg · October 3, 2020, 7:29pm

If I have two different neural networks (parametrized by model1 and model2) and corresponding two optimizers, would the below operation using model1.parameters without detach() lead to change in its gradients? My requirement is that I want to just compute the mean squared loss between the two model parameters but update the optimizer corresponding to model1.

opt1 = torch.optim.SGD(self.model.parameters(), lr=1e-3)
opt2 = torch.optim.SGD(self.model.parameters(), lr=1e-3)

loss = (self.lamb / 2.) * ((torch.nn.utils.parameters_to_vector(self.model1.parameters()) - torch.nn.utils.parameters_to_vector(self.model2.parameters()))**2).sum()

loss.backward()
opt1.step()

How can I decide in general whether to use detach for any operation or not?

albanD · October 5, 2020, 2:42pm

Hi,

parameters_to_vector is differentiable and so yes gradients will flow back to both models.

In general, there are very limited cases where you need .detach() within your training function. It is most often used when you want to save the loss for logging, or save a Tensor for later inspection but you don’t need gradient information.

Krishna_Garg · October 5, 2020, 3:20pm

Thanks @albanD for the reply.

Follow-up questions for more clarity:

You mentioned about parameters_to_vector being differentiable. So, how can I check whether a function is differentiable or not?

Also, can you be more specific about the operations in which detach is required. For instance, if I pass the model.parameters to any other function or use it like params = list(self.model1.parameters()), will these require the use of detach()?

albanD · October 5, 2020, 3:26pm

In general, all ops in pytorch are differentiable.
The main exceptions are .detach() and with torch.no_grad. As well as functions that work with nn.Parameter that needs to remain leafs and so cannot have gradient history.

Also, can you be more specific about the operations in which detach is required

You most likely never need it actually

Krishna_Garg · October 5, 2020, 3:41pm

The first part makes sense but not the second part. Sorry if that’s annoying.

albanD · October 5, 2020, 4:02pm

Detach is used to break the graph to mess with the gradient computation.
In 99% of the cases, you never want to do that.

The only weird cases where it can be useful are the ones I mentioned above where you want to use a Tensor that was used in a differentiable function for a function that is not expected to be differentiated. And so you can use detach() to express that. But this is a rare case.

zeck · June 10, 2022, 2:23pm

Hi @albanD i’m in this situation, i need to detach my tensor because the blackbox model that i use in my model don’t take tensor . How to do that properly please? everything change but my parameters’gradient stuck to zero, i suspect .detach(). Thank you…

albanD · June 10, 2022, 2:38pm

Hi,

If you want to get gradients through something that our autograd cannot see, you will have to use a custom Function so that you can tell the autograd what the backward is: Extending PyTorch — PyTorch 1.11.0 documentation
From within the forward there, you can just unpack the Tensor into whatever you want and pass that to your black box!

zeck · June 10, 2022, 2:48pm

Thank you so much for your reply. I will try and get back. I still ask my self why all parameters gradients stuck zero and not None? Even if i initialize bias and weights to certain values, they still change during inference, i’m a beginner and that confuse me. Thank you.

albanD · June 10, 2022, 2:51pm

Hi,

If the gradients are all zeros, the optimizer will still move if you have weight decay or momentum!

zeck · June 10, 2022, 3:11pm

Ok thank you. I 'm using Adam without weight decay but it’s momentum so i get it. I’ll try your suggestion and get back. So the lesson for now is to never use .detach() into a model.