Backward fails even after zeroing gradient?

josmi9966 · November 12, 2017, 3:53pm

I have a very simple piece of code which puzzles me (using Python 3.5.3 and PyTorch version 0.2.0_3, no CUDA)

As far as I understand, in order to run backward() on a variable
again (after already running it once), it is necessary to reset the
leaf gradients to zero first. But even when I do this, PyTorch will still complain
in the following example code:

“RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.”

import torch
from torch.autograd import Variable as V
x = V(torch.ones(2,2), requires_grad=True)
y = 3*x*x  # but with y=3*x it would work!!!
y.backward(torch.ones(2,2))
print("x.grad=",x.grad)
x.grad.data.zero_()
y.backward(torch.ones(2,2))

This happens when I calculate y=3*x*x but it does NOT happen when I calculate y=3*x!

How can I reset my gradients so that I can run backward a second time in my case? Is there a different, better way to make this work?

tom · November 12, 2017, 4:01pm

Pass retain_graph=True to the first backward.

Best regards

Thomas

josmi9966 · November 12, 2017, 4:09pm

Thank you @tom, I know that retain_graph=True should work, but
what I do not understand is why my example does not work while it
does work fine when the function is 3y? Also, I thought that zeroing
the gradient is a sufficient and safe way to do this - this is shown in
a number of tutorials I think.
Or is there simply no way to reset the graph and I have to retain it
(and zero it in addition in order to prevent accumulation of gradients)
if I want to call backward again? Then what is the point of not retaining
in the first place?
This is really confusing me.

SimonW · November 12, 2017, 5:11pm

If 3y works, it only works accidentally and it might change in future.

Zeroing gradient is different with retaining the graph. It makes sense to not retaining by default:

after each forward, it is common to use grad values to update the parameters. then your x should change, and calculating gradient through y is incorrect now because y is computed using old x.
to save memory

Do not think of x and y as symbolic variables. Think of them as tensors with values.

What I don’t understand is that why you don’t like retain_graph + zero_grad. It is the perfectly reasonable thing to do.

josmi9966 · November 13, 2017, 11:05am

OK, here is maybe a better explanation of what I try to understand:

When I use some pre-fabricated model for my network, the normal process of training it is

forward the input through the net and get the output
calculate the loss
zero the gradients (through optimizer.zero_grad() or mynetword.zero_grad())
backward the loss using something like loss.backward()
take an optimizer step
rinse and repeat

Now, nowhere have I ever seen that in the loss.backward() step we would specify retain_graph=True yet this always works!? In order for this to work, the loss.backward() function has to recursively call somevariable.backward() on all of the parameters of its network but will these calls use the retain_graph option? I assume not. So why does this work in general, but not in my case? Are there rules when it will work and when not?

I mainly want to understand what is going on and how PyTorch works so I do not even know yet if I “like” retaining the graph
But from the fact that practically all examples do not retain the graph and still work, my feeling is that there must be an advantage to not retaining it … and I would like to understand that as well!

SimonW · November 13, 2017, 3:32pm

There is a key difference between what you did and the workflow you described. In your example, you are backproping through the same graph x->3x->3x*x twice. Yet in usual workflow, each iteration a new graph is built and you backprop through it only once.

If retain_graph is not specified, then graph will be freed after something backproped through it.

josmi9966 · November 13, 2017, 4:39pm

Thank you @SimonW, I totally forgot that the forward step usually reconstructs the whole graph from scratch, and that this is what usually happens with the canned models!

So, the graph is re-built, but the variables the represent the model parameters are re-used, so the gradients for those parameters need to get set to zero in order to avoid accumulating them.