when you do a forward pass for a particular operation, where some of the inputs have a requires_grad=True, PyTorch needs to hold onto some of the inputs or intermediate values so that the backwards can be computed.
For example: If you do y = x * x (y = x squared), then the gradient is dl / dx = grad_output * 2 * x. Here, if xrequires_grad, then we hold onto x to compute the backward pass.
Take an example of:
y = x ** 2
z = y ** 2
del y
Over here, even if y is deleted out of Python scope, the function z = square(y) which is in the autograd graph (which effectively is z.grad_fn) holds onto y and in turn x.
So you might not have visibility into it via the GC, but it still exists until z is deleted out of python scope
Thanks @smth. So it sounds like there is no way to programmatically count the referenced data directly in cases like that.
It would be really cool to be able to have a call that can walk a model and count memory, similar to the way the backwards pass can compute it. Really, Iād like to be able to better estimate how much memory is consumed by different parts of the computation, whether on CPU or GPU.
Here is a minor fix to catch exceptions, since hasattr(obj, 'data') triggers all kinds of failures in python modules unrelated to pytorch and its tensors:
import torch
import gc
for obj in gc.get_objects():
try:
if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
print(type(obj), obj.size())
except: pass
edit: oh, only now noticed a follow up from @Ben_Usman, that suggested the same.
This post actually solved my problem! I use dynamic batch size and I was getting OOM CUDA error after a few iterations. Starting from the largest possible batch eliminated the problems and I get highher GPU utilization. Maybe this could be mentioned somewhere in the docs regarding data loading if it is not already, because it can really help increase batch size.
Unfortunately, this solution still doesnāt work for me :( I see memory_allocated increasing batch after batch while the tensors returned by this function donāt changeā¦ any thoughts?
I guess based on @smthās comment, there is something in the graph thatās being kept around but not reported by gcā¦ I am having a memory bug here (Advice on debugging a GPU memory leak in graph?) where, even when the model is in eval mode and with torch.no_grad(), there is increasing memory. However, I tried creating a minimal working example that creates a node in the graph (via a matmul) by multiplying an input by a parameter that requires a gradient, and then calls a forward pass many times, but I donāt see any increase in allocated memory.
For what itās worth, in my other post, if I replace the matmul with a simple +, thereās no memory leakā¦
I have this problem and my process is killed due to a memory leak. Can you explain a little more about how you make sure that the largest length batch goes first ? do not you shuffle the data?
I think youāre right that itās a hack.
I just recently tried a similar solution with deletion. It did not work. I used gc.collect() + torch.empty_cache(). I still somehow ended up with a memory leak which was only fixed when I started using reusable Tensors.