Hi,
I have read this thread here carefully: question but I am still confused about it.
Let’s say we have 3 operations:
x1 = input
x2 = op1(x1)
x3 = op2(x2)
x4 = op3(x3)
loss = lossfunction(x4)
What is the difference between:
x1 = input
x2 = op1(x1)
with torch.no_grad()
x3 = op2(x2)
x4 = op3(x3)
loss = lossfunction(x4)
and (using detach)
x1 = input
x2 = op1(x1)
x3 = op2(x2).detach()
x4 = op3(x3)
loss = lossfunction(x4)
Assuming all operations have learnable parameters, how does the gradient propagation, memory management and activations change?