How to debug causes of GPU memory leaks?

I have a Trainer which wraps all my training code with model initialization, dataset, optimizers etc. So to do hyperparameter search I will initialize my trainer within a for loop e.g.

for _ in range(10):

Though I can see that the VRAM usage slowly increases after each trainer.initialize()

To fix this I though that I would add:

# prints currently alive Tensors and Variables
import torch
import gc
for obj in gc.get_objects():
    if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(
        del obj

at the end of every training loop, though it still doesn’t work and my VRAM continues to increase after every iteration regardless.

Are the tensors found in gc.get_objects() all the alive tensors? Or are there others hiding somewhere else?

Did you write the trainer class yourself or are you using some other API?
In any case, could you post or link to the code of initialize() and fit()?
What exactly is trainer.recorder? Could it be its somehow holding a reference to the computation graph?

Hi thanks for the prompt reply!
The trainer class is one that I wrote myself, but I believe I found the culprit, the recorder class was holding a reference to the computation graph. Though I thought that by deleting all tensors, wouldn’t that also delete the reference to the computation graph?

1 Like

Good to hear you’ve found the bug! :slight_smile:

I’m not sure if deleting always works without any shortcomings. I’ve never used it as this seems to be kind of a hack.

Just in case anyone is still facing this issue, I changed @Ben_Usman code snippet to actually debug only specific functions, and also to clear the GPU cache periodically to analyze how much memory is used.

import os
import gc
import torch
import datetime

from py3nvml import py3nvml

# clears GPU cache frequently, showing only actual memory usage
gpu_profile_fn = (f"{}"
if 'GPU_DEBUG' in os.environ:
    print('profiling gpu usage to ', gpu_profile_fn)

_last_tensor_sizes = set()

def _trace_lines(frame, event, arg):
    if event != 'line':
    co = frame.f_code
    func_name = co.co_name
    line_no = frame.f_lineno
    filename = co.co_filename
    mem_used = _get_gpu_mem_used()
    where_str = f"{func_name} in {filename}:{line_no}"
    with open(gpu_profile_fn, 'a+') as f:
        f.write(f"{where_str} --> {mem_used:<7.1f}Mb\n")
            _print_tensors(f, where_str)


def trace_calls(frame, event, arg):
    if event != 'call':
    co = frame.f_code
    func_name = co.co_name

        trace_into = str(os.environ['TRACE_INTO'])
    if func_name in trace_into.split(' '):
        return _trace_lines

def _get_gpu_mem_used():
    handle = py3nvml.nvmlDeviceGetHandleByIndex(
    meminfo = py3nvml.nvmlDeviceGetMemoryInfo(handle)
    return meminfo.used/1024**2

def _print_tensors(f, where_str):
    global _last_tensor_sizes
    for tensor in _get_tensors():
        if not hasattr(tensor, 'dbg_alloc_where'):
            tensor.dbg_alloc_where = where_str
    new_tensor_sizes = {(x.type(), tuple(x.shape), x.dbg_alloc_where)
                        for x in _get_tensors()}
    for t, s, loc in new_tensor_sizes - _last_tensor_sizes:
        f.write(f'+ {loc:<50} {str(s):<20} {str(t):<10}\n')
    for t, s, loc in _last_tensor_sizes - new_tensor_sizes:
        f.write(f'- {loc:<50} {str(s):<20} {str(t):<10}\n')
    _last_tensor_sizes = new_tensor_sizes

def _get_tensors(gpu_only=True):
    for obj in gc.get_objects():
            if torch.is_tensor(obj):
                tensor = obj
            elif hasattr(obj, 'data') and torch.is_tensor(
                tensor =

            if tensor.is_cuda:
                yield tensor
        except Exception as e:

To setup the profiler:

        import sys
        from gpu_profile import trace_calls
        os.environ['GPU_DEBUG'] =
        os.environ['TRACE_INTO'] = 'train_epoch'

@smth I think that your method for finding all the tensor via Python’s garbage collector does not account for all tensors. I suppose that a corner case is for the backpropagation, when some tensor might be saved for the backward pass in a context and transformed (probably compressed in some way), hence they do not appear as tensors anymore. I wrote a method to account for the saved_tensors in the context for the backward pass. Could you please check if it extracts all the saved tensors correctly?

def get_tensors(only_cuda=False, omit_objs=[]):

    :return: list of active PyTorch tensors
    >>> import torch
    >>> from torch import tensor
    >>> clean_gc_return = map((lambda obj: del_object(obj)), gc.get_objects())
    >>> device = "cuda" if torch.cuda.is_available() else "cpu"
    >>> device = torch.device(device)
    >>> only_cuda = True if torch.cuda.is_available() else False
    >>> t1 = tensor([1], device=device)
    >>> a3 = tensor([[1, 2], [3, 4]], device=device)
    >>> # print(get_all_tensor_names())
    >>> tensors = [tensor_obj for tensor_obj in get_tensors(only_cuda=only_cuda)]
    >>> # print(tensors)
    >>> # We doubled each t1, a3 tensors because of the tensors collection.
    >>> expected_tensor_length = 2
    >>> assert len(tensors) == expected_tensor_length, f"Expected length of tensors {expected_tensor_length}, but got {len(tensors)}, the tensors: {tensors}"
    >>> exp_size = (2,2)
    >>> act_size = tensors[1].size()
    >>> assert exp_size == act_size, f"Expected size {exp_size} but got: {act_size}"
    >>> del t1
    >>> del a3
    >>> clean_gc_return = map((lambda obj: del_object(obj)), tensors)
    add_all_tensors = False if only_cuda is True else True
    # To avoid counting the same tensor twice, create a dictionary of tensors,
    # each one identified by its id (the in memory address).
    tensors = {}

    # omit_obj_ids = [id(obj) for obj in omit_objs]

    def add_tensor(obj):
        if torch.is_tensor(obj):
            tensor = obj
        elif hasattr(obj, 'data') and torch.is_tensor(
            tensor =

        if (only_cuda and tensor.is_cuda) or add_all_tensors:
            tensors[id(tensor)] = tensor

    for obj in gc.get_objects():
            # Add the obj if it is a tensor.
            # Some tensors are "saved & hidden" for the backward pass.
            if hasattr(obj, 'saved_tensors') and (id(obj) not in omit_objs):
                for tensor_obj in obj.saved_tensors:
        except Exception as ex:
            # print("Exception: ", ex)
            # logger.debug(f"Exception: {str(ex)}")
    return tensors.values()  # return a list of detected tensors

@Adam_Dziedzic If I remember, saved_tensors will only be triggered on obj for the functions in python land or functions that are directly alive. For autograd functions that are not alive anymore in python (but are alive because another Python object refers to them as part of grad_fn chain), those wont show up.

@smth Where does the rest of the memory live? I have an example where walking the gc objects as above gives me a number less than half of the value returned by torch.cuda.memory_allocated(). In my case, the gc object approach gives me about 1.1GB and torch.cuda.memory_allocated() returned 2.8GB.

Where is the rest hiding? This doesn’t seem like it would be simple pytorch bookkeeping overhead.

when you do a forward pass for a particular operation, where some of the inputs have a requires_grad=True, PyTorch needs to hold onto some of the inputs or intermediate values so that the backwards can be computed.

For example: If you do y = x * x (y = x squared), then the gradient is dl / dx = grad_output * 2 * x. Here, if x requires_grad, then we hold onto x to compute the backward pass.

Take an example of:

y = x ** 2
z = y ** 2
del y

Over here, even if y is deleted out of Python scope, the function z = square(y) which is in the autograd graph (which effectively is z.grad_fn) holds onto y and in turn x.
So you might not have visibility into it via the GC, but it still exists until z is deleted out of python scope


Thanks @smth. So it sounds like there is no way to programmatically count the referenced data directly in cases like that.

It would be really cool to be able to have a call that can walk a model and count memory, similar to the way the backwards pass can compute it. Really, I’d like to be able to better estimate how much memory is consumed by different parts of the computation, whether on CPU or GPU.


Thank you, @smth

Here is a minor fix to catch exceptions, since hasattr(obj, 'data') triggers all kinds of failures in python modules unrelated to pytorch and its tensors:

import torch
import gc
for obj in gc.get_objects():
        if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(
            print(type(obj), obj.size())
    except: pass

edit: oh, only now noticed a follow up from @Ben_Usman, that suggested the same.


Do we have placeholder tensor in PyTorch? How do you make it?

im_data = torch.FloatTensor(1).cuda()

for step in range(...):
    batch_data = next(train_iter):[0].shape).copy_(batch_data[0])
    scores = net(im_data)
1 Like

In the example you give (or more generally), how does one detect that y remains undeleted even if GC doesn’t acknowledge it?

Since 2018, have there been any tools for debugging memory leaks?

1 Like

Great, I think the GPU memory leak issue is in adding new nodes in the graph. Can you share how to add placeholder in PyTorch?

This post actually solved my problem! I use dynamic batch size and I was getting OOM CUDA error after a few iterations. Starting from the largest possible batch eliminated the problems and I get highher GPU utilization. Maybe this could be mentioned somewhere in the docs regarding data loading if it is not already, because it can really help increase batch size.

Unfortunately, this solution still doesn’t work for me :( I see memory_allocated increasing batch after batch while the tensors returned by this function don’t change… any thoughts?

I guess based on @smth’s comment, there is something in the graph that’s being kept around but not reported by gc… I am having a memory bug here (Advice on debugging a GPU memory leak in graph?) where, even when the model is in eval mode and with torch.no_grad(), there is increasing memory. However, I tried creating a minimal working example that creates a node in the graph (via a matmul) by multiplying an input by a parameter that requires a gradient, and then calls a forward pass many times, but I don’t see any increase in allocated memory.

For what it’s worth, in my other post, if I replace the matmul with a simple +, there’s no memory leak…

@Even_Oldridge @ptrblck Is there any documentation about the behaviour @Even_Oldridge describes in this earlier comment? I would like to better understand the mechanisms applied by Pytorch that leads to this behaviour.

I have this problem and my process is killed due to a memory leak. Can you explain a little more about how you make sure that the largest length batch goes first ? do not you shuffle the data?

I think you’re right that it’s a hack.
I just recently tried a similar solution with deletion. It did not work. I used gc.collect() + torch.empty_cache(). I still somehow ended up with a memory leak which was only fixed when I started using reusable Tensors.