from torch.autograd import grad def data(n=10000, d=2, eps=1): x = torch.randn(n, d) * eps y = x + torch.randn(n, d) * eps z = y + torch.randn(n, d) return torch.cat((x, z), dim=1), y.sum(dim=1, keepdim=True) dummy_psi = torch.nn.Parameter(torch.ones(4, 1)) dummy_w = torch.nn.Parameter(torch.Tensor([1.0])) opt = torch.optim.SGD([phi], lr=1e-3) mse = torch.nn.MSELoss(reduction="none") pairs = data(eps=0.1) for i in range(50000): error = 0 penalty_value = 0 for x, y in [pairs]: error_e = mse(x @ dummy_psi * dummy_w, y) mean_error = error_e.mean() g = grad(mean_error, dummy_w) penalty_value += g.sum() error += mean_error opt.zero_grad() (1e-3 * error + penalty_value).backward() opt.step() if i % 1000 == 0: print(dummy_psi)
Keep getting the following error when trying to compute gradients with autograd "RuntimeError: Trying to backward through the graph a second time"
Try to use
Thanks, should I be using
My question which I cannot answer yet is why
.backward() cannot backprop through the gradients? What exactly is it that makes it require
Based on the provided code it seems
grad(mean_error, dummy_w) might be freeing the intermediate tensors, which are still needed in
(error + penalty_value).backward().
retain_graph=True will make sure to keep these intermediates alive.
That’s very interesting to know. From reading the docs seems that I would have been unable to tell why that would happen. Are there any insights on why
grad might freeze intermediate tensors?
Sorry for the typo. I meant the backward call frees (deletes) the intermediate tensors, not freezes.
This is done to save memory, as the gradients are already calculated and these tensors are not needed anymore.
However, if your use case needs these tensors again, you have to use
Ah, now that explains a lot. Thanks for this insight.
Although it strikes as odd that
backward() doesn’t understand that I’m still using the intermediate tensors because the values on which I call
backward() depend on those intermediate tensors.
For instance, I would have expected
backward() to recognise that
penalty_value depends on
g which is the grad since I’m passing it explicitly in the call of
BTW, which is better recommended from these two options? Thanks!
grad(output, input, create_graph=True)
g = grad(mean_error, dummy_w) will already remove the intermediates, so that
backward cannot do anything about it anymore.
It depends highly on your use case and I can’t give a general answer.
Please correct me if I’m wrong but from the docs what I understand (unless it’s otherwise stated) is that we should avoid
Could you give a two cent summary on how 1. and 2. are dependent based on use case? What would be the major diff. in chosing one versus the other?
Thanks a lot!
It seems like with
retain_graph=True there’s not much change in the gradients compared to
create_graph=True which is kind of strange?
@ptrblck Sorry to bother, do you know if there are any ways to capture memory leaks when using
For instance in the MWE in my first post if I replace
mse(x @ dummy_psi * dummy_w) with
mse(vgg_like(x) * dummy_w) I get cuda error out of memory.
When I switch the whole example to cpu I can monitor the ram keep increasing on every iteration until it takes all my available ram. I’ve tracked down the issue to the
grad() method because when I comment it out everything works fine but I don’t know why this is happening or where the leak commes from?
Also this leak happens regardless of using