What the difference between detach(), detach_(),and with torch.no_grad() in our training phase?

ptrblck · December 18, 2019, 5:43am

They might yield the same result, but in specific edge cases you might get different and unexpected results:

An optimizer, which doesn’t hold certain parameters, won’t update them. However other optimizers could still update these parameters, if they are passed.
If you are using weight decay, have updated some parameters in the past and are now freezing them, these parameters could still be updated, if they were passed to the optimizer.
detach() could act in the same way as the second point.

That being said, I would prefer the explicit way instead of relying implicitly that no side effects can occur.