Help with cuda out of memory?

jlquinn · January 8, 2018, 11:39pm

Hi folks, I’m trying to get to the bottom of out of memory issues I’m seeing. I reliably run out of memory part way through my training run. It doesn’t appear to be a leak, since it looks like I reset to the same memory usage at the start of each update. But I haven’t been able to locate the problem.

I am dumping GPU memory being used as follows:
for obj in gc.get_objects():
try:
if torch.is_tensor(obj):
tensor = obj
elif hasattr(obj, ‘data’) and torch.is_tensor(obj.data):
tensor = obj.data
elif hasattr(obj, ‘grad’) and torch.is_tensor(obj.grad):
tensor = obj.grad
else:
continue

        if tensor.is_cuda:
            store = tensor.storage()
            if store.data_ptr() in seen:
                continue
            total += store.size() * store.element_size()
            seen.add(store.data_ptr())

This should give me the size in bytes on the GPU. At the end of an update, after backward() has run, I’m seeing less than 3GB according to the above code. I added the memory size dump at numerous places in the forward pass and don’t see more than 3G anywhere.

When I pause in the debugger, run torch.cuda.empty_cache(), then run nvidia-smi, I see 4.5G usually and sometimes 5.5G. However, I still die during backward() on one of my minibatches. The minibatch isn’t larger than others, and even if it is, I don’t see where the memory could be going.

I’m running on torch 0.3.0 on power8 K80 with 12GB of RAM.

This is a mystery to me. Any help would be appreciated.

Thanks!
Jerry Quinn

SimonW · January 9, 2018, 12:03am

Not all tensors are observable through python. So your code doesn’t print the actual GPU memory usage.

If you can build from source, try building from this PR: https://github.com/pytorch/pytorch/pull/4511. It can give the actual GPU usage.

jlquinn · January 9, 2018, 1:35am

Cool, that will be helpful. Do you know if the patch will apply to 0.3.0 or only to master?

SimonW · January 9, 2018, 2:24am

Likely only to master.