What's difference between retain_graph and retain_variables?

gauss-clb · January 10, 2018, 2:35pm

What’s difference between retain_graph and retain_variables for backward?

The doc says when we need to backpropagate twice, we need set retain_variables=True.

But I have tried example below:

f = Variable(torch.Tensor([2,3]), requires_grad=True)
g = f[0] + f[1]
g.backward()
print(f.grad)
g.backward()
print(f.grad)

It works well but I don’t set retain_variables=True, who can tell me why?

And I’m very confused, doc says it will free buffer after the first backpropagation when we don’t set retain_variables=True, but why not recreate buffer when calculating gradients for second time?

jpeg729 · January 10, 2018, 3:45pm

I asked the same question a few days ago.

gauss-clb · January 11, 2018, 1:13am

I have two other questions above, could you give me answer?

jpeg729 · January 11, 2018, 9:05am

When you run the forward pass, the input values are saved, so that when you run the backward pass, the gradients can be properly calculated. Once the input values have been discarded, the gradients can no longer be computed.

Some operations, such as addition, do not require the inputs to be saved in order to properly calculate the gradients. Try multiplication instead.

f = Variable(torch.Tensor([2,3]), requires_grad=True)
g = f[0] * f[1]
g.backward()
f.grad
g.backward()
f.grad

gauss-clb · January 11, 2018, 9:56am

Why addition operation don’t need buffer, could you explain some details?

jpeg729 · January 11, 2018, 10:07am

You need to revise your calculus.

If f(x) = x + w then the gradient of f with respect to w is 1. In this case the gradient doesn’t depend on the inputs.

If f(x) = x * w then the gradient of f with respect to w is x. In this case, we need to save the input value.