How to debug causes of GPU memory leaks?

Cool! Your fix works. You saved my day (almost)!
I’ve encountered “out of memory” crash using caffe to extract features with pretrained resnet. So I rewrote the code in pytorch and still met this error. After wrapping the extraction part in a function, it went much further: from 400+ images to 1200+ images. However, in the end it still ran out of memory.
I was watching the GPU RAM usage closely, and saw the RAM increased from 4G to 9G and then 11G before it crashed (the total GPU RAM is 12G).
Really confused why this would happen. Anyway I’m just using some standard model to do minimal processing.

Update:
I managed to reduce RAM use by wrapping the extraction code in a no_grad block:

with torch.no_grad():
    features = model(im).data

Now the code runs through all the 15000+ images without any error :smile:
Though, I still don’t understand why keeping the gradients will cause the memory problem, as I’m looping through each individual image, and in the next loop, the gradients should be dereferenced and released automatically.

Update 2:
Finally I solved the memory problem! I realized that in each iteration I put the input data in a new tensor, and pytorch generates a new computation graph. That causes the used RAM to grow forever. Then I use a placeholder tensor and copy the data to this tensor, and the RAM always stays at a low level :smile:

12 Likes

I have a Trainer which wraps all my training code with model initialization, dataset, optimizers etc. So to do hyperparameter search I will initialize my trainer within a for loop e.g.

for _ in range(10):
    trainer.initialize()
    trainer.fit()
    recorders.append(trainer.recorder)

Though I can see that the VRAM usage slowly increases after each trainer.initialize()

To fix this I though that I would add:

# prints currently alive Tensors and Variables
import torch
import gc
for obj in gc.get_objects():
    if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
        del obj
torch.cuda.empty_cache()

at the end of every training loop, though it still doesn’t work and my VRAM continues to increase after every iteration regardless.

Are the tensors found in gc.get_objects() all the alive tensors? Or are there others hiding somewhere else?

Did you write the trainer class yourself or are you using some other API?
In any case, could you post or link to the code of initialize() and fit()?
What exactly is trainer.recorder? Could it be its somehow holding a reference to the computation graph?

Hi thanks for the prompt reply!
The trainer class is one that I wrote myself, but I believe I found the culprit, the recorder class was holding a reference to the computation graph. Though I thought that by deleting all tensors, wouldn’t that also delete the reference to the computation graph?

1 Like

Good to hear you’ve found the bug! :slight_smile:

I’m not sure if deleting always works without any shortcomings. I’ve never used it as this seems to be kind of a hack.

Just in case anyone is still facing this issue, I changed @Ben_Usman code snippet to actually debug only specific functions, and also to clear the GPU cache periodically to analyze how much memory is used.

import os
import gc
import torch
import datetime

from py3nvml import py3nvml

PRINT_TENSOR_SIZES = True
# clears GPU cache frequently, showing only actual memory usage
EMPTY_CACHE = True
gpu_profile_fn = (f"{datetime.datetime.now():%d-%b-%y-%H:%M:%S}"
                  f"-gpu_mem_prof.txt")
if 'GPU_DEBUG' in os.environ:
    print('profiling gpu usage to ', gpu_profile_fn)

_last_tensor_sizes = set()


def _trace_lines(frame, event, arg):
    if event != 'line':
        return
    if EMPTY_CACHE:
        torch.cuda.empty_cache()
    co = frame.f_code
    func_name = co.co_name
    line_no = frame.f_lineno
    filename = co.co_filename
    py3nvml.nvmlInit()
    mem_used = _get_gpu_mem_used()
    where_str = f"{func_name} in {filename}:{line_no}"
    with open(gpu_profile_fn, 'a+') as f:
        f.write(f"{where_str} --> {mem_used:<7.1f}Mb\n")
        if PRINT_TENSOR_SIZES:
            _print_tensors(f, where_str)

    py3nvml.nvmlShutdown()


def trace_calls(frame, event, arg):
    if event != 'call':
        return
    co = frame.f_code
    func_name = co.co_name

    try:
        trace_into = str(os.environ['TRACE_INTO'])
    except:
        print(os.environ)
        exit()
    if func_name in trace_into.split(' '):
        return _trace_lines
    return


def _get_gpu_mem_used():
    handle = py3nvml.nvmlDeviceGetHandleByIndex(
        int(os.environ['GPU_DEBUG']))
    meminfo = py3nvml.nvmlDeviceGetMemoryInfo(handle)
    return meminfo.used/1024**2


def _print_tensors(f, where_str):
    global _last_tensor_sizes
    for tensor in _get_tensors():
        if not hasattr(tensor, 'dbg_alloc_where'):
            tensor.dbg_alloc_where = where_str
    new_tensor_sizes = {(x.type(), tuple(x.shape), x.dbg_alloc_where)
                        for x in _get_tensors()}
    for t, s, loc in new_tensor_sizes - _last_tensor_sizes:
        f.write(f'+ {loc:<50} {str(s):<20} {str(t):<10}\n')
    for t, s, loc in _last_tensor_sizes - new_tensor_sizes:
        f.write(f'- {loc:<50} {str(s):<20} {str(t):<10}\n')
    _last_tensor_sizes = new_tensor_sizes


def _get_tensors(gpu_only=True):
    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj):
                tensor = obj
            elif hasattr(obj, 'data') and torch.is_tensor(obj.data):
                tensor = obj.data
            else:
                continue

            if tensor.is_cuda:
                yield tensor
        except Exception as e:
            pass

To setup the profiler:

        import sys
        from gpu_profile import trace_calls
        os.environ['GPU_DEBUG'] = args.dev
        os.environ['TRACE_INTO'] = 'train_epoch'
        sys.settrace(trace_calls)
5 Likes

@smth I think that your method for finding all the tensor via Python’s garbage collector does not account for all tensors. I suppose that a corner case is for the backpropagation, when some tensor might be saved for the backward pass in a context and transformed (probably compressed in some way), hence they do not appear as tensors anymore. I wrote a method to account for the saved_tensors in the context for the backward pass. Could you please check if it extracts all the saved tensors correctly?

def get_tensors(only_cuda=False, omit_objs=[]):
    """

    :return: list of active PyTorch tensors
    >>> import torch
    >>> from torch import tensor
    >>> clean_gc_return = map((lambda obj: del_object(obj)), gc.get_objects())
    >>> device = "cuda" if torch.cuda.is_available() else "cpu"
    >>> device = torch.device(device)
    >>> only_cuda = True if torch.cuda.is_available() else False
    >>> t1 = tensor([1], device=device)
    >>> a3 = tensor([[1, 2], [3, 4]], device=device)
    >>> # print(get_all_tensor_names())
    >>> tensors = [tensor_obj for tensor_obj in get_tensors(only_cuda=only_cuda)]
    >>> # print(tensors)
    >>> # We doubled each t1, a3 tensors because of the tensors collection.
    >>> expected_tensor_length = 2
    >>> assert len(tensors) == expected_tensor_length, f"Expected length of tensors {expected_tensor_length}, but got {len(tensors)}, the tensors: {tensors}"
    >>> exp_size = (2,2)
    >>> act_size = tensors[1].size()
    >>> assert exp_size == act_size, f"Expected size {exp_size} but got: {act_size}"
    >>> del t1
    >>> del a3
    >>> clean_gc_return = map((lambda obj: del_object(obj)), tensors)
    """
    add_all_tensors = False if only_cuda is True else True
    # To avoid counting the same tensor twice, create a dictionary of tensors,
    # each one identified by its id (the in memory address).
    tensors = {}

    # omit_obj_ids = [id(obj) for obj in omit_objs]

    def add_tensor(obj):
        if torch.is_tensor(obj):
            tensor = obj
        elif hasattr(obj, 'data') and torch.is_tensor(obj.data):
            tensor = obj.data
        else:
            return

        if (only_cuda and tensor.is_cuda) or add_all_tensors:
            tensors[id(tensor)] = tensor

    for obj in gc.get_objects():
        try:
            # Add the obj if it is a tensor.
            add_tensor(obj)
            # Some tensors are "saved & hidden" for the backward pass.
            if hasattr(obj, 'saved_tensors') and (id(obj) not in omit_objs):
                for tensor_obj in obj.saved_tensors:
                    add_tensor(tensor_obj)
        except Exception as ex:
            pass
            # print("Exception: ", ex)
            # logger.debug(f"Exception: {str(ex)}")
    return tensors.values()  # return a list of detected tensors

@Adam_Dziedzic If I remember, saved_tensors will only be triggered on obj for the functions in python land or functions that are directly alive. For autograd functions that are not alive anymore in python (but are alive because another Python object refers to them as part of grad_fn chain), those wont show up.

@smth Where does the rest of the memory live? I have an example where walking the gc objects as above gives me a number less than half of the value returned by torch.cuda.memory_allocated(). In my case, the gc object approach gives me about 1.1GB and torch.cuda.memory_allocated() returned 2.8GB.

Where is the rest hiding? This doesn’t seem like it would be simple pytorch bookkeeping overhead.

when you do a forward pass for a particular operation, where some of the inputs have a requires_grad=True, PyTorch needs to hold onto some of the inputs or intermediate values so that the backwards can be computed.

For example: If you do y = x * x (y = x squared), then the gradient is dl / dx = grad_output * 2 * x. Here, if x requires_grad, then we hold onto x to compute the backward pass.

Take an example of:

y = x ** 2
z = y ** 2
del y

Over here, even if y is deleted out of Python scope, the function z = square(y) which is in the autograd graph (which effectively is z.grad_fn) holds onto y and in turn x.
So you might not have visibility into it via the GC, but it still exists until z is deleted out of python scope

5 Likes

Thanks @smth. So it sounds like there is no way to programmatically count the referenced data directly in cases like that.

It would be really cool to be able to have a call that can walk a model and count memory, similar to the way the backwards pass can compute it. Really, I’d like to be able to better estimate how much memory is consumed by different parts of the computation, whether on CPU or GPU.

2 Likes

Thank you, @smth

Here is a minor fix to catch exceptions, since hasattr(obj, 'data') triggers all kinds of failures in python modules unrelated to pytorch and its tensors:

import torch
import gc
for obj in gc.get_objects():
    try:
        if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
            print(type(obj), obj.size())
    except: pass

edit: oh, only now noticed a follow up from @Ben_Usman, that suggested the same.

3 Likes

Do we have placeholder tensor in PyTorch? How do you make it?

im_data = torch.FloatTensor(1).cuda()
....

for step in range(...):
    batch_data = next(train_iter):
    im_data.data.resize_(batch_data[0].shape).copy_(batch_data[0])
    scores = net(im_data)
    ....
1 Like

In the example you give (or more generally), how does one detect that y remains undeleted even if GC doesn’t acknowledge it?

Since 2018, have there been any tools for debugging memory leaks?

Great, I think the GPU memory leak issue is in adding new nodes in the graph. Can you share how to add placeholder in PyTorch?

This post actually solved my problem! I use dynamic batch size and I was getting OOM CUDA error after a few iterations. Starting from the largest possible batch eliminated the problems and I get highher GPU utilization. Maybe this could be mentioned somewhere in the docs regarding data loading if it is not already, because it can really help increase batch size.

Unfortunately, this solution still doesn’t work for me :( I see memory_allocated increasing batch after batch while the tensors returned by this function don’t change… any thoughts?

I guess based on @smth’s comment, there is something in the graph that’s being kept around but not reported by gc… I am having a memory bug here (Advice on debugging a GPU memory leak in graph?) where, even when the model is in eval mode and with torch.no_grad(), there is increasing memory. However, I tried creating a minimal working example that creates a node in the graph (via a matmul) by multiplying an input by a parameter that requires a gradient, and then calls a forward pass many times, but I don’t see any increase in allocated memory.

For what it’s worth, in my other post, if I replace the matmul with a simple +, there’s no memory leak…

@Even_Oldridge @ptrblck Is there any documentation about the behaviour @Even_Oldridge describes in this earlier comment? I would like to better understand the mechanisms applied by Pytorch that leads to this behaviour.

I have this problem and my process is killed due to a memory leak. Can you explain a little more about how you make sure that the largest length batch goes first ? do not you shuffle the data?