Tensors seem to be held by execution frames

NightMachinery · July 28, 2023, 6:40pm

My PyTorch code has a GPU memory leak.

To debug this issue, I used this function:

import gc
import torch
from pynight.common_torch import torch_memory_tensor


def find_tensors_on_gpu():
    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj):
                if obj.is_cuda:
                    obj_size = torch_memory_tensor(obj, s=2) #: MB
                    if obj_size >= 10:
                        print(f'Tensor ID: {id(obj)} Type: {type(obj)} Size (MB): {obj_size}')
                        for ref in gc.get_referrers(obj):
                            try:
                                if isinstance(ref, dict):
                                    for k, v in ref.items():
                                        if v is obj:
                                            print(f'Variable Name: {k}')
                                else:
                                    print(f"ref: {ref}")
                            except Exception as e:
                                pass
        except Exception as e:
            pass

gc.collect()
find_tensors_on_gpu()

The output of this function is:

Tensor ID: 140152393953856 Type: <class 'torch.Tensor'> Size (MB): 23.0859375
Variable Name: features_out
ref: <frame at 0x7f77c8dae640, file '/home/vit/code/pytorch-image-models/timm/models/decomposition.py', line 1873, code forward>
Tensor ID: 140152393947616 Type: <class 'torch.Tensor'> Size (MB): 1142.75390625
Variable Name: attributions_v
ref: <frame at 0x7f77c8dae640, file '/home/vit/code/pytorch-image-models/timm/models/decomposition.py', line 1873, code forward>
Variable Name: attributions_v
Tensor ID: 140159443545328 Type: <class 'torch.Tensor'> Size (MB): 1142.75390625
Variable Name: residual_attributions_v
ref: <frame at 0x55c5761c0270, file '/home/vit/code/pytorch-image-models/timm/models/vision_transformer.py', line 251, code forward>

There are two problems:

My GPU memory is at 7299MiB, while these tensors only sum up to about 3GB.
These tensors seem to be held by an execution frame of Python?! How do I free them?

fix a memory leak on exception (caused by the stored traceback) by stas00 · Pull Request #11572 · ipython/ipython · GitHub

ptrblck · July 28, 2023, 10:05pm

PyTorch uses a caching allocator as described here.
The tensors seem to be referenced in the forward method so might be needed for the gradient computation in a backward call.

NightMachinery · July 28, 2023, 10:27pm

Is there a way to free up all the computational graph of a model’s parameters? Sth like model.zero_grad() but that frees up the stored computational graph.

PS: The Jupyter issue I linked was indeed my primary problem, and the workarounds there worked. In short, Jupyter stores references to local-scope tensors when an exception happens, and this seriously messes up the gc.

ptrblck · July 28, 2023, 11:47pm

You could either call backward() on the output/loss which would free the intermediates or del the output of the model to delete all references to the intermediates and allow PyTorch to reuse the memory.