Hi folks, I’m trying to get to the bottom of out of memory issues I’m seeing. I reliably run out of memory part way through my training run. It doesn’t appear to be a leak, since it looks like I reset to the same memory usage at the start of each update. But I haven’t been able to locate the problem.
I am dumping GPU memory being used as follows:
for obj in gc.get_objects():
try:
if torch.is_tensor(obj):
tensor = obj
elif hasattr(obj, ‘data’) and torch.is_tensor(obj.data):
tensor = obj.data
elif hasattr(obj, ‘grad’) and torch.is_tensor(obj.grad):
tensor = obj.grad
else:
continue
if tensor.is_cuda:
store = tensor.storage()
if store.data_ptr() in seen:
continue
total += store.size() * store.element_size()
seen.add(store.data_ptr())
This should give me the size in bytes on the GPU. At the end of an update, after backward() has run, I’m seeing less than 3GB according to the above code. I added the memory size dump at numerous places in the forward pass and don’t see more than 3G anywhere.
When I pause in the debugger, run torch.cuda.empty_cache(), then run nvidia-smi, I see 4.5G usually and sometimes 5.5G. However, I still die during backward() on one of my minibatches. The minibatch isn’t larger than others, and even if it is, I don’t see where the memory could be going.
I’m running on torch 0.3.0 on power8 K80 with 12GB of RAM.
This is a mystery to me. Any help would be appreciated.
Thanks!
Jerry Quinn