No_grad vs detach

palimboa · September 26, 2019, 11:07am

Hi,

I have read this thread here carefully: question but I am still confused about it.

Let’s say we have 3 operations:

x1 = input
x2 = op1(x1)
x3 = op2(x2)
x4 = op3(x3)
loss = lossfunction(x4)

What is the difference between:


x1 = input
x2 = op1(x1)
with torch.no_grad()
    x3 = op2(x2)
x4 = op3(x3)
loss = lossfunction(x4)

and (using detach)

x1 = input
x2 = op1(x1)
x3 = op2(x2).detach()
x4 = op3(x3)
loss = lossfunction(x4)

Assuming all operations have learnable parameters, how does the gradient propagation, memory management and activations change?

albanD · September 26, 2019, 2:01pm

Hi,

In this particular case, they will be quite similar.
The only difference I guess is that in the first case, you will never even instantiate the state in op2. While you will in the second case and discard them later.
If you do inplace operations in op2 though, there will be some difference.