Keep getting the following error when trying to compute gradients with autograd "RuntimeError: Trying to backward through the graph a second time"

from torch.autograd import grad

def data(n=10000, d=2, eps=1):
    x = torch.randn(n, d) * eps
    y = x + torch.randn(n, d) * eps
    z = y + torch.randn(n, d)
    return torch.cat((x, z), dim=1), y.sum(dim=1, keepdim=True)

dummy_psi = torch.nn.Parameter(torch.ones(4, 1))
dummy_w = torch.nn.Parameter(torch.Tensor([1.0]))

opt = torch.optim.SGD([phi], lr=1e-3)
mse = torch.nn.MSELoss(reduction="none")

pairs = data(eps=0.1)

for i in range(50000):
    error = 0
    penalty_value = 0
    for x, y in [pairs]:
        error_e = mse(x @ dummy_psi * dummy_w, y)
        mean_error = error_e.mean()
        g = grad(mean_error, dummy_w)[0]
        penalty_value += g.sum()
        error += mean_error
    opt.zero_grad()
    (1e-3 * error + penalty_value).backward()
    opt.step()
    if i % 1000 == 0:
        print(dummy_psi)

Try to use retain_graph=True in autograd.grad.

Thanks, should I be using retain_graph=True over create_graph=True?

My question which I cannot answer yet is why .backward() cannot backprop through the gradients? What exactly is it that makes it require retain_graph=True or create_graph=True?

Based on the provided code it seems grad(mean_error, dummy_w) might be freeing the intermediate tensors, which are still needed in (error + penalty_value).backward().
retain_graph=True will make sure to keep these intermediates alive.

That’s very interesting to know. From reading the docs seems that I would have been unable to tell why that would happen. Are there any insights on why grad might freeze intermediate tensors?

Sorry for the typo. I meant the backward call frees (deletes) the intermediate tensors, not freezes. :wink:
This is done to save memory, as the gradients are already calculated and these tensors are not needed anymore.
However, if your use case needs these tensors again, you have to use retain_graph=True.

Ah, now that explains a lot. Thanks for this insight.

Although it strikes as odd that backward() doesn’t understand that I’m still using the intermediate tensors because the values on which I call backward() depend on those intermediate tensors.

For instance, I would have expected backward() to recognise that penalty_value depends on g which is the grad since I’m passing it explicitly in the call of backward().

BTW, which is better recommended from these two options? Thanks!

  1. .backward(retain_graph=True)
  2. grad(output, input, create_graph=True)

g = grad(mean_error, dummy_w)[0] will already remove the intermediates, so that backward cannot do anything about it anymore.

It depends highly on your use case and I can’t give a general answer. :confused:

Please correct me if I’m wrong but from the docs what I understand (unless it’s otherwise stated) is that we should avoid retain_graph=True?

Could you give a two cent summary on how 1. and 2. are dependent based on use case? What would be the major diff. in chosing one versus the other?

Thanks a lot!

It seems like with retain_graph=True there’s not much change in the gradients compared to create_graph=True which is kind of strange?

@ptrblck Sorry to bother, do you know if there are any ways to capture memory leaks when using grad() from autograd.?

For instance in the MWE in my first post if I replace mse(x @ dummy_psi * dummy_w) with mse(vgg_like(x) * dummy_w) I get cuda error out of memory.

When I switch the whole example to cpu I can monitor the ram keep increasing on every iteration until it takes all my available ram. I’ve tracked down the issue to the grad() method because when I comment it out everything works fine but I don’t know why this is happening or where the leak commes from?

Also this leak happens regardless of using retain-graph=True or create_graph=True.