How to interpret the error message CUDA out of memory

Junhan_Ouyang · May 20, 2021, 9:41pm

Sorry I am relatively new to Pytorch and I know this is an old and common problem
RuntimeError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 14.76 GiB total capacity; 13.24 GiB already allocated; 97.75 MiB free; 13.63 GiB reserved in total by PyTorch)

So I am trying to train a CycleGan for 2 days. But I face numerous GPU memory problem like it said above.
So when I try to solve the runtime cuda out of memory error. I try to interpret the message. I am thinking that they first tried to allocated 160MB on GPU during my runtime session. There is total 14.76GB total available memory for this particular GPU. However only 13.63 GB is allowed to be used by Pytorch? And 13.24GB already been allocated for this runtime session, so it only have 97.75 MB free memory left, and the required 160MB is larger than the free 97.75MB so it throw the memory error?
Am I interpret the message correctly?

If yes, I don’t know why my training use so many memory. For cycle GAN my generator have like 6 million parameters, discriminator probably 0.5 million parameter. If I add it up it should be around 40-50MB. During my training batches, my image size is 256 * 256, each batch size is 30. Which brings up to around 8-9 MB. Even if I add it up it’s way less than 160MB. I don’t think there is a memory leak problem because the error throw in the first training epochs during my first generator forward calculation. So did I mess up something or is this behavior expected?

Also is it true that 13.24GB in the error message about GPU memory is allocated by me as well, or is it used by other application? The reason I am asking is I am not doing anything with GPU besides training the model so how in the world did 13.24GB been allocated already?

One last question. So I know in order to fix the problem I need to either reduce the batch size or reduce the model size. However many times I found out when I reduce the batch size and model size, the required allocated memory sometime even increase, for example if I reduce my channel by half in Generator conv layer, the expected memory usage should be reduce by close to half, but sometime the error says the required allocated memory actually increased by half. Why is that happening? Also for the free memory sometime for example it will show I have 80 MB free available. But when I reduce the batch size and re-run again it says I only have 40MB left available? Why is this happening as well?

Sorry for many questions, but the memory problem is just really frustrated to solve

eqy · May 20, 2021, 10:34pm

Yes, the issue is that more memory is required than what is available with current allocations. Note that typical training will use far more memory than just what is required for model parameters and the input data. All of the intermediate activations are typically stored for the backward pass and this can often use the bulk of required memory during training. The behavior of the memory requirements being far greater than just the parameters and input size is expected.

(A clarification on input images: assuming single precision input, a batch size of 30 with 256x256 input images will use 30x256x256x3 (channels) x 4 bytes which is closer to 23MB.)

You can check the distribution of memory usage via nvidia-smi or similar commands. You might be able to get a few tens of MB back by killing any graphical environments (e.g., a Linux deskop environment) that are running concurrently on the GPU.

For this last part it is likely due to the different sizes of tensors that need to be allocated. The total memory requirement of the model maybe lower when you reduce the channels, but it may fail to allocate at a different part of the model so the error message reports a different number. For example, consider a hypothetical scenario where your model needs to allocate 3 tensors of size 30MiB each but there is only 50 MiB total of GPU memory available. After the first allocation, it will fail with 20MiB remaining while needing to allocate 30MiB for just the second tensor. (30/50MiB used) If you reduce each tensor to 20MiB, it will fail after the second allocation while needing to allocate 20MiB for the third tensor. (40/50MiB used) So really the “how much free” value reported in the error is just for diagnostic purposes and not an indicator that what you are doing isn’t helping.

Junhan_Ouyang · May 20, 2021, 11:09pm

Thanks. But just to clarify is 13.24GB of GPU allocated memory in error message also used by my runtime session or is it used by some other application I am not aware of? Because compare to 160MB. It is still a 100 times jump which is a lot if it’s used by my runtime session

eqy · May 20, 2021, 11:09pm

It is probably used by your session but it is hard to know without running something like nvidia-smi to see what process the memory belongs to.