What the difference between detach(), detach_(),and with torch.no_grad() in our training phase?

They might yield the same result, but in specific edge cases you might get different and unexpected results:

  • An optimizer, which doesn’t hold certain parameters, won’t update them. However other optimizers could still update these parameters, if they are passed.
  • If you are using weight decay, have updated some parameters in the past and are now freezing them, these parameters could still be updated, if they were passed to the optimizer.
  • detach() could act in the same way as the second point.

That being said, I would prefer the explicit way instead of relying implicitly that no side effects can occur. :wink:

1 Like