My PyTorch code has a GPU memory leak.
To debug this issue, I used this function:
import gc
import torch
from pynight.common_torch import torch_memory_tensor
def find_tensors_on_gpu():
for obj in gc.get_objects():
try:
if torch.is_tensor(obj):
if obj.is_cuda:
obj_size = torch_memory_tensor(obj, s=2) #: MB
if obj_size >= 10:
print(f'Tensor ID: {id(obj)} Type: {type(obj)} Size (MB): {obj_size}')
for ref in gc.get_referrers(obj):
try:
if isinstance(ref, dict):
for k, v in ref.items():
if v is obj:
print(f'Variable Name: {k}')
else:
print(f"ref: {ref}")
except Exception as e:
pass
except Exception as e:
pass
gc.collect()
find_tensors_on_gpu()
The output of this function is:
Tensor ID: 140152393953856 Type: <class 'torch.Tensor'> Size (MB): 23.0859375
Variable Name: features_out
ref: <frame at 0x7f77c8dae640, file '/home/vit/code/pytorch-image-models/timm/models/decomposition.py', line 1873, code forward>
Tensor ID: 140152393947616 Type: <class 'torch.Tensor'> Size (MB): 1142.75390625
Variable Name: attributions_v
ref: <frame at 0x7f77c8dae640, file '/home/vit/code/pytorch-image-models/timm/models/decomposition.py', line 1873, code forward>
Variable Name: attributions_v
Tensor ID: 140159443545328 Type: <class 'torch.Tensor'> Size (MB): 1142.75390625
Variable Name: residual_attributions_v
ref: <frame at 0x55c5761c0270, file '/home/vit/code/pytorch-image-models/timm/models/vision_transformer.py', line 251, code forward>
There are two problems:
- My GPU memory is at 7299MiB, while these tensors only sum up to about 3GB.
- These tensors seem to be held by an execution frame of Python?! How do I free them?
Related: