Did you write the trainer
class yourself or are you using some other API?
In any case, could you post or link to the code of initialize()
and fit()
?
What exactly is trainer.recorder
? Could it be its somehow holding a reference to the computation graph?
Hi thanks for the prompt reply!
The trainer
class is one that I wrote myself, but I believe I found the culprit, the recorder class was holding a reference to the computation graph. Though I thought that by deleting all tensors, wouldn’t that also delete the reference to the computation graph?
Good to hear you’ve found the bug!
I’m not sure if deleting always works without any shortcomings. I’ve never used it as this seems to be kind of a hack.
Just in case anyone is still facing this issue, I changed @Ben_Usman code snippet to actually debug only specific functions, and also to clear the GPU cache periodically to analyze how much memory is used.
import os
import gc
import torch
import datetime
from py3nvml import py3nvml
PRINT_TENSOR_SIZES = True
# clears GPU cache frequently, showing only actual memory usage
EMPTY_CACHE = True
gpu_profile_fn = (f"{datetime.datetime.now():%d-%b-%y-%H:%M:%S}"
f"-gpu_mem_prof.txt")
if 'GPU_DEBUG' in os.environ:
print('profiling gpu usage to ', gpu_profile_fn)
_last_tensor_sizes = set()
def _trace_lines(frame, event, arg):
if event != 'line':
return
if EMPTY_CACHE:
torch.cuda.empty_cache()
co = frame.f_code
func_name = co.co_name
line_no = frame.f_lineno
filename = co.co_filename
py3nvml.nvmlInit()
mem_used = _get_gpu_mem_used()
where_str = f"{func_name} in {filename}:{line_no}"
with open(gpu_profile_fn, 'a+') as f:
f.write(f"{where_str} --> {mem_used:<7.1f}Mb\n")
if PRINT_TENSOR_SIZES:
_print_tensors(f, where_str)
py3nvml.nvmlShutdown()
def trace_calls(frame, event, arg):
if event != 'call':
return
co = frame.f_code
func_name = co.co_name
try:
trace_into = str(os.environ['TRACE_INTO'])
except:
print(os.environ)
exit()
if func_name in trace_into.split(' '):
return _trace_lines
return
def _get_gpu_mem_used():
handle = py3nvml.nvmlDeviceGetHandleByIndex(
int(os.environ['GPU_DEBUG']))
meminfo = py3nvml.nvmlDeviceGetMemoryInfo(handle)
return meminfo.used/1024**2
def _print_tensors(f, where_str):
global _last_tensor_sizes
for tensor in _get_tensors():
if not hasattr(tensor, 'dbg_alloc_where'):
tensor.dbg_alloc_where = where_str
new_tensor_sizes = {(x.type(), tuple(x.shape), x.dbg_alloc_where)
for x in _get_tensors()}
for t, s, loc in new_tensor_sizes - _last_tensor_sizes:
f.write(f'+ {loc:<50} {str(s):<20} {str(t):<10}\n')
for t, s, loc in _last_tensor_sizes - new_tensor_sizes:
f.write(f'- {loc:<50} {str(s):<20} {str(t):<10}\n')
_last_tensor_sizes = new_tensor_sizes
def _get_tensors(gpu_only=True):
for obj in gc.get_objects():
try:
if torch.is_tensor(obj):
tensor = obj
elif hasattr(obj, 'data') and torch.is_tensor(obj.data):
tensor = obj.data
else:
continue
if tensor.is_cuda:
yield tensor
except Exception as e:
pass
To setup the profiler:
import sys
from gpu_profile import trace_calls
os.environ['GPU_DEBUG'] = args.dev
os.environ['TRACE_INTO'] = 'train_epoch'
sys.settrace(trace_calls)
@smth I think that your method for finding all the tensor via Python’s garbage collector does not account for all tensors. I suppose that a corner case is for the backpropagation, when some tensor might be saved for the backward pass in a context and transformed (probably compressed in some way), hence they do not appear as tensors anymore. I wrote a method to account for the saved_tensors in the context for the backward pass. Could you please check if it extracts all the saved tensors correctly?
def get_tensors(only_cuda=False, omit_objs=[]):
"""
:return: list of active PyTorch tensors
>>> import torch
>>> from torch import tensor
>>> clean_gc_return = map((lambda obj: del_object(obj)), gc.get_objects())
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> device = torch.device(device)
>>> only_cuda = True if torch.cuda.is_available() else False
>>> t1 = tensor([1], device=device)
>>> a3 = tensor([[1, 2], [3, 4]], device=device)
>>> # print(get_all_tensor_names())
>>> tensors = [tensor_obj for tensor_obj in get_tensors(only_cuda=only_cuda)]
>>> # print(tensors)
>>> # We doubled each t1, a3 tensors because of the tensors collection.
>>> expected_tensor_length = 2
>>> assert len(tensors) == expected_tensor_length, f"Expected length of tensors {expected_tensor_length}, but got {len(tensors)}, the tensors: {tensors}"
>>> exp_size = (2,2)
>>> act_size = tensors[1].size()
>>> assert exp_size == act_size, f"Expected size {exp_size} but got: {act_size}"
>>> del t1
>>> del a3
>>> clean_gc_return = map((lambda obj: del_object(obj)), tensors)
"""
add_all_tensors = False if only_cuda is True else True
# To avoid counting the same tensor twice, create a dictionary of tensors,
# each one identified by its id (the in memory address).
tensors = {}
# omit_obj_ids = [id(obj) for obj in omit_objs]
def add_tensor(obj):
if torch.is_tensor(obj):
tensor = obj
elif hasattr(obj, 'data') and torch.is_tensor(obj.data):
tensor = obj.data
else:
return
if (only_cuda and tensor.is_cuda) or add_all_tensors:
tensors[id(tensor)] = tensor
for obj in gc.get_objects():
try:
# Add the obj if it is a tensor.
add_tensor(obj)
# Some tensors are "saved & hidden" for the backward pass.
if hasattr(obj, 'saved_tensors') and (id(obj) not in omit_objs):
for tensor_obj in obj.saved_tensors:
add_tensor(tensor_obj)
except Exception as ex:
pass
# print("Exception: ", ex)
# logger.debug(f"Exception: {str(ex)}")
return tensors.values() # return a list of detected tensors
@Adam_Dziedzic If I remember, saved_tensors
will only be triggered on obj
for the functions in python land or functions that are directly alive. For autograd functions that are not alive anymore in python (but are alive because another Python object refers to them as part of grad_fn
chain), those wont show up.
@smth Where does the rest of the memory live? I have an example where walking the gc objects as above gives me a number less than half of the value returned by torch.cuda.memory_allocated(). In my case, the gc object approach gives me about 1.1GB and torch.cuda.memory_allocated() returned 2.8GB.
Where is the rest hiding? This doesn’t seem like it would be simple pytorch bookkeeping overhead.
when you do a forward pass for a particular operation, where some of the inputs have a requires_grad=True
, PyTorch needs to hold onto some of the inputs or intermediate values so that the backwards can be computed.
For example: If you do y = x * x
(y = x squared), then the gradient is dl / dx = grad_output * 2 * x
. Here, if x
requires_grad
, then we hold onto x
to compute the backward pass.
Take an example of:
y = x ** 2
z = y ** 2
del y
Over here, even if y
is deleted out of Python scope, the function z = square(y)
which is in the autograd graph (which effectively is z.grad_fn
) holds onto y
and in turn x
.
So you might not have visibility into it via the GC, but it still exists until z
is deleted out of python scope
Thanks @smth. So it sounds like there is no way to programmatically count the referenced data directly in cases like that.
It would be really cool to be able to have a call that can walk a model and count memory, similar to the way the backwards pass can compute it. Really, I’d like to be able to better estimate how much memory is consumed by different parts of the computation, whether on CPU or GPU.
Thank you, @smth
Here is a minor fix to catch exceptions, since hasattr(obj, 'data')
triggers all kinds of failures in python modules unrelated to pytorch and its tensors:
import torch
import gc
for obj in gc.get_objects():
try:
if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
print(type(obj), obj.size())
except: pass
edit: oh, only now noticed a follow up from @Ben_Usman, that suggested the same.
Do we have placeholder tensor in PyTorch? How do you make it?
im_data = torch.FloatTensor(1).cuda()
....
for step in range(...):
batch_data = next(train_iter):
im_data.data.resize_(batch_data[0].shape).copy_(batch_data[0])
scores = net(im_data)
....
In the example you give (or more generally), how does one detect that y remains undeleted even if GC doesn’t acknowledge it?
Since 2018, have there been any tools for debugging memory leaks?
Great, I think the GPU memory leak issue is in adding new nodes in the graph. Can you share how to add placeholder in PyTorch?
This post actually solved my problem! I use dynamic batch size and I was getting OOM CUDA error after a few iterations. Starting from the largest possible batch eliminated the problems and I get highher GPU utilization. Maybe this could be mentioned somewhere in the docs regarding data loading if it is not already, because it can really help increase batch size.
Unfortunately, this solution still doesn’t work for me :(
I see memory_allocated
increasing batch after batch while the tensors returned by this function don’t change… any thoughts?
I guess based on @smth’s comment, there is something in the graph that’s being kept around but not reported by gc
… I am having a memory bug here (Advice on debugging a GPU memory leak in graph?) where, even when the model is in eval mode and with torch.no_grad()
, there is increasing memory. However, I tried creating a minimal working example that creates a node in the graph (via a matmul) by multiplying an input by a parameter that requires a gradient, and then calls a forward pass many times, but I don’t see any increase in allocated memory.
For what it’s worth, in my other post, if I replace the matmul with a simple +
, there’s no memory leak…
@Even_Oldridge @ptrblck Is there any documentation about the behaviour @Even_Oldridge describes in this earlier comment? I would like to better understand the mechanisms applied by Pytorch that leads to this behaviour.
I have this problem and my process is killed due to a memory leak. Can you explain a little more about how you make sure that the largest length batch goes first ? do not you shuffle the data?
I think you’re right that it’s a hack.
I just recently tried a similar solution with deletion. It did not work. I used gc.collect() + torch.empty_cache(). I still somehow ended up with a memory leak which was only fixed when I started using reusable Tensors.
hey can you tell me what did you mean by reusable tensor?