What the difference between detach(), detach_(),and with torch.no_grad() in our training phase?

Eric_K · December 18, 2019, 2:45am

When I don’t need a network to update their parameters, I will not put this network’s paramters into optimizer’s parameter argument. And when I run forward，I will warp “with torch.no_grad():” for my “model(input)”.

Are there some wrongs in my operations?

Thanks…

ptrblck · December 18, 2019, 4:56am

Your assumptions are correct.
with torch.no_grad() won’t track the wrapped operations by Autograd, so that intermediate tensors won’t be stored, which would be needed for the backward pass.
That being said, you should not wrap the complete forward pass in this block during training, as you won’t be able to calculate the gradients.

detach() operates on a tensor and returns the same tensor, which will be detached from the computation graph at this point, so that the backward pass will stop at this point.
detach_() is the inplace operation of detach().

Eric_K · December 18, 2019, 5:23am

Thanks for your reply~

If I just want to freeze part of a single network. "detach“ method is more convenient for me.
If I want to freeze one separated network from others network. These three method are equivalent to me. Is that right?

Thank you~ ^^

ptrblck · December 18, 2019, 5:43am

They might yield the same result, but in specific edge cases you might get different and unexpected results:

An optimizer, which doesn’t hold certain parameters, won’t update them. However other optimizers could still update these parameters, if they are passed.
If you are using weight decay, have updated some parameters in the past and are now freezing them, these parameters could still be updated, if they were passed to the optimizer.
detach() could act in the same way as the second point.

That being said, I would prefer the explicit way instead of relying implicitly that no side effects can occur.

Kevin96 · January 31, 2022, 7:36pm

I still don’t understand the difference between detach() and detach_(). They both do not create a copy and thus I don’t quite understand why we need an in-place version of detach(). Can you clarify the difference between them?

Also, can you clarify why “Views cannot be detached in-place”?
https://pytorch.org/docs/stable/generated/torch.Tensor.detach_.html#torch.Tensor.detach_

Thanks!

ptrblck · January 31, 2022, 11:41pm

The inplace version would also detach the source tensor as seen here:

# inplace, non-view
a = torch.randn(1, 1, requires_grad=False)
b = torch.randn(1, 1, requires_grad=True)
c = a * b
d = c

d.detach_() # detaches d and c since inplace
d[0, 0] = 1.

(c + a).sum().backward()
# > RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

“Views cannot be detached in-place” refers to e.g. this code snippet:

# inplace, view
a = torch.randn(1, 1, requires_grad=False)
b = torch.randn(1, 1, requires_grad=True)
c = a * b
d = c.view(1, 1)

d.detach_() 
# > RuntimeError: Can't detach views in-place. Use detach() instead.

Detaching d not inplace works if you want to use c for the backward pass:

# not inplace
a = torch.randn(1, 1, requires_grad=False)
b = torch.randn(1, 1, requires_grad=True)
c = a * b
d = c.detach()

d[0, 0] = 1.

(c + a).sum().backward()

zhiyuanpeng · March 15, 2022, 3:21am

Thanks for your elaboration, and I’d like to confirm my understanding is right or not: the best practice is to use one of (detach, detach_, no_grad) and do not pass the parameters of the freezed network to optimizer? Thanks.

ptrblck · March 15, 2022, 3:26am

Yes, I would recommend to be as explicit as possible. I.e. if you don’t want the optimizer to update the parameters, don’t pass these parameters to it. Otherwise, you might hit edge cases, which might be hard to debug.

zhiyuanpeng · March 15, 2022, 3:28am

Got it. Thank you very much!