```
from torch.autograd import grad
def data(n=10000, d=2, eps=1):
x = torch.randn(n, d) * eps
y = x + torch.randn(n, d) * eps
z = y + torch.randn(n, d)
return torch.cat((x, z), dim=1), y.sum(dim=1, keepdim=True)
dummy_psi = torch.nn.Parameter(torch.ones(4, 1))
dummy_w = torch.nn.Parameter(torch.Tensor([1.0]))
opt = torch.optim.SGD([phi], lr=1e-3)
mse = torch.nn.MSELoss(reduction="none")
pairs = data(eps=0.1)
for i in range(50000):
error = 0
penalty_value = 0
for x, y in [pairs]:
error_e = mse(x @ dummy_psi * dummy_w, y)
mean_error = error_e.mean()
g = grad(mean_error, dummy_w)[0]
penalty_value += g.sum()
error += mean_error
opt.zero_grad()
(1e-3 * error + penalty_value).backward()
opt.step()
if i % 1000 == 0:
print(dummy_psi)
```

Try to use `retain_graph=True`

in `autograd.grad`

.

Thanks, should I be using `retain_graph=True`

over `create_graph=True`

?

My question which I cannot answer yet is why `.backward()`

cannot backprop through the gradients? What exactly is it that makes it require `retain_graph=True`

or `create_graph=True`

?

Based on the provided code it seems `grad(mean_error, dummy_w)`

might be freeing the intermediate tensors, which are still needed in `(error + penalty_value).backward()`

.

`retain_graph=True`

will make sure to keep these intermediates alive.

That’s very interesting to know. From reading the docs seems that I would have been unable to tell why that would happen. Are there any insights on why `grad`

might freeze intermediate tensors?

Sorry for the typo. I meant the backward call *frees* (deletes) the intermediate tensors, not freezes.

This is done to save memory, as the gradients are already calculated and these tensors are not needed anymore.

However, if your use case needs these tensors again, you have to use `retain_graph=True`

.

Ah, now that explains a lot. Thanks for this insight.

Although it strikes as odd that `backward()`

doesn’t understand that I’m still using the intermediate tensors because the values on which I call `backward()`

depend on those intermediate tensors.

For instance, I would have expected `backward()`

to recognise that `penalty_value`

depends on `g`

which is the grad since I’m passing it explicitly in the call of `backward()`

.

BTW, which is better recommended from these two options? Thanks!

`.backward(retain_graph=True)`

`grad(output, input, create_graph=True)`

`g = grad(mean_error, dummy_w)[0]`

will already remove the intermediates, so that `backward`

cannot do anything about it anymore.

It depends highly on your use case and I can’t give a general answer.

Please correct me if I’m wrong but from the docs what I understand (unless it’s otherwise stated) is that we should avoid `retain_graph=True`

?

Could you give a two cent summary on how 1. and 2. are dependent based on use case? What would be the major diff. in chosing one versus the other?

Thanks a lot!

It seems like with `retain_graph=True`

there’s not much change in the gradients compared to `create_graph=True`

which is kind of strange?

@ptrblck Sorry to bother, do you know if there are any ways to capture memory leaks when using `grad()`

from `autograd`

.?

For instance in the MWE in my first post if I replace `mse(x @ dummy_psi * dummy_w)`

with `mse(vgg_like(x) * dummy_w)`

I get cuda error out of memory.

When I switch the whole example to cpu I can monitor the ram keep increasing on every iteration until it takes all my available ram. I’ve tracked down the issue to the `grad()`

method because when I comment it out everything works fine but I don’t know why this is happening or where the leak commes from?

Also this leak happens regardless of using `retain-graph=True`

or `create_graph=True`

.