GPU out of memory error on loss.backward() but plenty of memory available - ModelParallel

raybannerman · April 5, 2020, 3:51am

I am training models on large volumetric images with a batch size of 1, where a single example is 402 x 420 x 420 pixels (so I cannot make the batch size any smaller as I am already at a batch size of 1. I also cannot make the training examples any smaller.) I am using two NVIDIA Titan Xp GPUs. I have trained many models on this data set successfully.

However, over the past week I have gotten weird out of memory errors at loss.backward() for some models which are almost identical to other models I’ve run, except for a couple extra layers which add a tiny number of additional parameters relative to the overall model size.

Here is the error - Pytorch claims that it can’t allocate 1.41 GiB even though it also says there is plenty of free memory available:

When I put everything on GPU 0 and leave GPU 1 empty:
RuntimeError: CUDA out of memory. Tried to allocate 1.41 GiB (GPU 0; 11.91 GiB total capacity; 4.28 GiB already allocated; 6.42 GiB free; 675.20 MiB cached)

When I use ModelParallel and put part of the model on GPU 0 and the other part on GPU 1 (in different ways):
RuntimeError: CUDA out of memory. Tried to allocate 1.41 GiB (GPU 1; 11.91 GiB total capacity; 4.27 GiB already allocated; 6.09 GiB free; 1016.44 MiB cached)
RuntimeError: CUDA out of memory. Tried to allocate 1.41 GiB (GPU 1; 11.91 GiB total capacity; 270.62 MiB already allocated; 11.09 GiB free; 3.38 MiB cached)

I am using torch.cuda.empty_cache(), and I have a fixed size input and I am using torch.backends.cudnn.benchmark = True
torch.backends.cudnn.enabled = True
(although the error occurs regardless of how I set these flags).

I read that perhaps this error can happen from memory fragmentation but I don’t know what I can do to resolve the problem and I also don’t know why the memory fragmentation would be so bad because the models where this error occurs are extremely similar to other models that don’t cause the error.

I’ve been trying everything I can to solve this problem. If anyone has suggestions about how to troubleshoot it, I would be very grateful.