Keep getting CUDA OOM error with Pytorch failing to allocate all free memory

sbelharbi · November 21, 2021, 9:34pm

hi,
i am having cuda oom for some time now. but i couldnt find the reason.
memory fragmentation could be the reason…

it was until i upgraded to pytorch 1.9.0 (and now 1.10.0), start using ddp (with single gpu), train with minibatch > 1, and evaluate (for validation) with minibatch of 1, and the trainset of the datasets have a last minibatch of size (3, 4, 8) less than the size of the minibatch (32) (which i just realized it).

more likely cause of oom:

oom used to happen more frequently during validation (even at the first run before start training.). could validation with minibatch size 1 cause a huge memory fragmentation? now, i use torch.cuda.empty_cache() before start running evaluation and right after the end of all evaluation. i run few jobs using this strategy, i dont see oom in validation part. switched from nccl bacnked to gloo. some recommend against nccl for single gpu. it is not really necessary since there is no communication nor synch.
now, oom is happening during training. i dont want to use frequently torch.cuda.empty_cache() because i learned that it is expensive. is it possible that the last minibatch in trainset could cause a memory fragmentation? the error occurs after running many epochs. so, memory fragmentation is plausible cause as it does not occur right away.

    scaler.scale(loss).backward()
  File "home/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "home/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 14.76 GiB total capacity; 12.64 GiB already allocated; 161.75 MiB free; 13.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

the application here is for cnn.
i read somewhere that in nlp apps, they pad the minbatch to a large max sequence size (because sequences have different length) to:

allocate the same blocksize for every minibatch.
when using torch.backends.cudnn.benchmark=True, as i do, it is better to keep the minbatch size the same because cuda has tuned the operations to that size to avoid re-selection of algo every time.

so, probably the last minibatch is the one causing the memory fragmentation leading to oom.

less likely cause of oom:

pytorch allocator >= 1.9.0 has an issue? ddp makes it worse?

is this the right way to limit block splitting?

export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

what is ‘‘best’’ max_split_size_mb value?
pytorch doc does not really explain much about this choice. they mentioned that this could have huge cost in term of performance (i assume speed) as no cost.
can you explain a bit about the cost?

i learned here that defragmenting cuda memory is not a thing, yet.

i have first to fix the last minbatch before trying export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb to avoid potential cost.

thanks