Autograd + garbage collection

srishti-git1110 · July 26, 2022, 12:31pm

I am trying to understand how pytorch’s autograd engine frees up memory.
Referred to this but I still have questions.

What does freeing of graph exactly mean?

According to my understanding from what I researched, freeing of backward graphs implies that the references to saved tensors are freed from the memory but the underlying graph remains.

The following code shows that even after a backward call on k, it had its backward graph remaining which was in turn passed to tensor m as well.

import torch
m = torch.tensor([0.0], requires_grad=True)
for i in range(1):
  x = torch.tensor([3.0], requires_grad=True)
  k = x*2
  k.backward()
  m = k

print(m.requires_grad)

True

print(m)

tensor([6.], grad_fn=<MulBackward0>)

This also forms one use of detach() - when we need to use a variable’s value outside of the training loop, we detach it before doing so; as the backward call only frees the references to saved tensors but the graph hangs around in the memory.
While if we use the detached tensor outside the train loop, the backward graph of the original one gets garbage collected. See -

import torch
m = torch.tensor([0.0], requires_grad=True)
for i in range(1):
  x = torch.tensor([3.0], requires_grad=True)
  k = x*2
  k.backward()
  m = k.detach()

print(m.requires_grad)
print(m)

False
tensor([6.])

Could you please elaborate why PyTorch chose to keep those backward graphs in memory when the saved tensors are anyway gone?
What’s the utility of keeping those while also considering the fact that graphs are created from scratch every time after a backward call?

evrimozmermer · July 28, 2022, 12:43pm

Firstly, PyTorch works with dynamic graphs. Everytime you put a tensor to a operation, if it is not included in a graph, a new graph is created so that you can do all the operations that requires a graph. Computation graphs are not only used for backprop but also for inference. If you create the graph of a model, you can export that graph and use it in inference, which will be more efficient than loading the model with its class and checkpoint and making inference. By using the graph, you can make load your model for inference without model file and checkpoint.
Most probably there are other reasons to keep the graph, but this is what comes to my mind now.
You can imagine computation graph as the compiled version of a code. You can either reload the code and compile and run it, or you can directly run the compiled version of the code.

srishti-git1110 · July 28, 2022, 1:08pm

That’s a great answer. I did not know we could export model graphs, too.
I have in past exported model checkpoints in .pt files and those always (for reasons unknown to me) give errors while loading in another notebook/colab.
Thanks for the answer.

srishti-git1110 · July 28, 2022, 1:13pm

FOR ANYONE READING THIS THREAD:
Please note (in the second block of code) the backward graph of the original tensor k getting garbage collected implies that the detached tensor m no longer points to the computation graph and since the scope of k is limited to the for loop, the graph k points to gets garbage collected when GC hits automatically.

elliotwaite · August 18, 2022, 3:47pm

Hi, I’m also curious why the backward graph is not automatically garbage collected when retain_graph=False. This is also confusing because the documentation for the retain_graph parameter seems to suggest otherwise:

retain_graph (bool, optional ) – If False , the graph used to compute the grads will be freed.

Is that documentation correct?

@evrimozmermer, you mentioned that there are other uses for graphs other than backprop, such as exporting them for later use. I have heard of exporting forward graphs for later use, such as for inference, but I have not thought about the potential of also exporting backward graphs. Is it possible to export a backward graph for later reuse, even if the saved tensors that were referenced by that backward graph have already been freed?

evrimozmermer · August 18, 2022, 9:21pm

One example for use of backward graph can be the gradient update of the teacher model in bring your own latent (byol). You can check the paper in arxiv. Teacher model is basically being updated by the moving average of the gradients of the student model.