How to debug causes of GPU memory leaks?

when you do a forward pass for a particular operation, where some of the inputs have a requires_grad=True, PyTorch needs to hold onto some of the inputs or intermediate values so that the backwards can be computed.

For example: If you do y = x * x (y = x squared), then the gradient is dl / dx = grad_output * 2 * x. Here, if x requires_grad, then we hold onto x to compute the backward pass.

Take an example of:

y = x ** 2
z = y ** 2
del y

Over here, even if y is deleted out of Python scope, the function z = square(y) which is in the autograd graph (which effectively is z.grad_fn) holds onto y and in turn x.
So you might not have visibility into it via the GC, but it still exists until z is deleted out of python scope

6 Likes

Thanks @smth. So it sounds like there is no way to programmatically count the referenced data directly in cases like that.

It would be really cool to be able to have a call that can walk a model and count memory, similar to the way the backwards pass can compute it. Really, Iā€™d like to be able to better estimate how much memory is consumed by different parts of the computation, whether on CPU or GPU.

2 Likes

Thank you, @smth

Here is a minor fix to catch exceptions, since hasattr(obj, 'data') triggers all kinds of failures in python modules unrelated to pytorch and its tensors:

import torch
import gc
for obj in gc.get_objects():
    try:
        if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
            print(type(obj), obj.size())
    except: pass

edit: oh, only now noticed a follow up from @Ben_Usman, that suggested the same.

4 Likes

Do we have placeholder tensor in PyTorch? How do you make it?

im_data = torch.FloatTensor(1).cuda()
....

for step in range(...):
    batch_data = next(train_iter):
    im_data.data.resize_(batch_data[0].shape).copy_(batch_data[0])
    scores = net(im_data)
    ....
1 Like

In the example you give (or more generally), how does one detect that y remains undeleted even if GC doesnā€™t acknowledge it?

Since 2018, have there been any tools for debugging memory leaks?

1 Like

Great, I think the GPU memory leak issue is in adding new nodes in the graph. Can you share how to add placeholder in PyTorch?

This post actually solved my problem! I use dynamic batch size and I was getting OOM CUDA error after a few iterations. Starting from the largest possible batch eliminated the problems and I get highher GPU utilization. Maybe this could be mentioned somewhere in the docs regarding data loading if it is not already, because it can really help increase batch size.

Unfortunately, this solution still doesnā€™t work for me :( I see memory_allocated increasing batch after batch while the tensors returned by this function donā€™t changeā€¦ any thoughts?

I guess based on @smthā€™s comment, there is something in the graph thatā€™s being kept around but not reported by gcā€¦ I am having a memory bug here (Advice on debugging a GPU memory leak in graph?) where, even when the model is in eval mode and with torch.no_grad(), there is increasing memory. However, I tried creating a minimal working example that creates a node in the graph (via a matmul) by multiplying an input by a parameter that requires a gradient, and then calls a forward pass many times, but I donā€™t see any increase in allocated memory.

For what itā€™s worth, in my other post, if I replace the matmul with a simple +, thereā€™s no memory leakā€¦

@Even_Oldridge @ptrblck Is there any documentation about the behaviour @Even_Oldridge describes in this earlier comment? I would like to better understand the mechanisms applied by Pytorch that leads to this behaviour.

I have this problem and my process is killed due to a memory leak. Can you explain a little more about how you make sure that the largest length batch goes first ? do not you shuffle the data?

I think youā€™re right that itā€™s a hack.
I just recently tried a similar solution with deletion. It did not work. I used gc.collect() + torch.empty_cache(). I still somehow ended up with a memory leak which was only fixed when I started using reusable Tensors.

hey can you tell me what did you mean by reusable tensor?

As in rather than creating/allocating several new tensors in a loop I started just reusing one tensor for the same purpose.