How to debug causes of GPU memory leaks?

smth · December 11, 2018, 4:56pm

when you do a forward pass for a particular operation, where some of the inputs have a requires_grad=True, PyTorch needs to hold onto some of the inputs or intermediate values so that the backwards can be computed.

For example: If you do y = x * x (y = x squared), then the gradient is dl / dx = grad_output * 2 * x. Here, if x requires_grad, then we hold onto x to compute the backward pass.

Take an example of:

y = x ** 2
z = y ** 2
del y

Over here, even if y is deleted out of Python scope, the function z = square(y) which is in the autograd graph (which effectively is z.grad_fn) holds onto y and in turn x.
So you might not have visibility into it via the GC, but it still exists until z is deleted out of python scope

jlquinn · December 11, 2018, 10:21pm

Thanks @smth. So it sounds like there is no way to programmatically count the referenced data directly in cases like that.

It would be really cool to be able to have a call that can walk a model and count memory, similar to the way the backwards pass can compute it. Really, I’d like to be able to better estimate how much memory is consumed by different parts of the computation, whether on CPU or GPU.

stas · December 30, 2018, 2:21am

Thank you, @smth

Here is a minor fix to catch exceptions, since hasattr(obj, 'data') triggers all kinds of failures in python modules unrelated to pytorch and its tensors:

import torch
import gc
for obj in gc.get_objects():
    try:
        if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
            print(type(obj), obj.size())
    except: pass

edit: oh, only now noticed a follow up from @Ben_Usman, that suggested the same.

Zhen_Cao · January 22, 2019, 4:14am

Do we have placeholder tensor in PyTorch? How do you make it?

jefflee · January 22, 2019, 4:36am

im_data = torch.FloatTensor(1).cuda()
....

for step in range(...):
    batch_data = next(train_iter):
    im_data.data.resize_(batch_data[0].shape).copy_(batch_data[0])
    scores = net(im_data)
    ....

RylanSchaeffer · June 29, 2020, 12:40am

In the example you give (or more generally), how does one detect that y remains undeleted even if GC doesn’t acknowledge it?

Since 2018, have there been any tools for debugging memory leaks?

GoingMyWay · August 27, 2020, 6:51am

Great, I think the GPU memory leak issue is in adding new nodes in the graph. Can you share how to add placeholder in PyTorch?

Jan_Vainer · September 16, 2020, 7:12am

This post actually solved my problem! I use dynamic batch size and I was getting OOM CUDA error after a few iterations. Starting from the largest possible batch eliminated the problems and I get highher GPU utilization. Maybe this could be mentioned somewhere in the docs regarding data loading if it is not already, because it can really help increase batch size.

alsuhr · September 26, 2020, 10:10pm

Unfortunately, this solution still doesn’t work for me :( I see memory_allocated increasing batch after batch while the tensors returned by this function don’t change… any thoughts?

I guess based on @smth’s comment, there is something in the graph that’s being kept around but not reported by gc… I am having a memory bug here (Advice on debugging a GPU memory leak in graph?) where, even when the model is in eval mode and with torch.no_grad(), there is increasing memory. However, I tried creating a minimal working example that creates a node in the graph (via a matmul) by multiplying an input by a parameter that requires a gradient, and then calls a forward pass many times, but I don’t see any increase in allocated memory.

For what it’s worth, in my other post, if I replace the matmul with a simple +, there’s no memory leak…

visionscaper · October 24, 2020, 6:54pm

@Even_Oldridge @ptrblck Is there any documentation about the behaviour @Even_Oldridge describes in this earlier comment? I would like to better understand the mechanisms applied by Pytorch that leads to this behaviour.

sepideh · June 4, 2021, 5:02am

I have this problem and my process is killed due to a memory leak. Can you explain a little more about how you make sure that the largest length batch goes first ? do not you shuffle the data?

sad_robot · September 9, 2021, 7:43pm

I think you’re right that it’s a hack.
I just recently tried a similar solution with deletion. It did not work. I used gc.collect() + torch.empty_cache(). I still somehow ended up with a memory leak which was only fixed when I started using reusable Tensors.

tuvovan · July 31, 2023, 6:51am

hey can you tell me what did you mean by reusable tensor?

sad_robot · April 24, 2024, 4:41pm

As in rather than creating/allocating several new tensors in a loop I started just reusing one tensor for the same purpose.