Sorry I am relatively new to Pytorch and I know this is an old and common problem
RuntimeError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 14.76 GiB total capacity; 13.24 GiB already allocated; 97.75 MiB free; 13.63 GiB reserved in total by PyTorch)
So I am trying to train a CycleGan for 2 days. But I face numerous GPU memory problem like it said above.
So when I try to solve the runtime cuda out of memory error. I try to interpret the message. I am thinking that they first tried to allocated 160MB on GPU during my runtime session. There is total 14.76GB total available memory for this particular GPU. However only 13.63 GB is allowed to be used by Pytorch? And 13.24GB already been allocated for this runtime session, so it only have 97.75 MB free memory left, and the required 160MB is larger than the free 97.75MB so it throw the memory error?
Am I interpret the message correctly?
If yes, I don’t know why my training use so many memory. For cycle GAN my generator have like 6 million parameters, discriminator probably 0.5 million parameter. If I add it up it should be around 40-50MB. During my training batches, my image size is 256 * 256, each batch size is 30. Which brings up to around 8-9 MB. Even if I add it up it’s way less than 160MB. I don’t think there is a memory leak problem because the error throw in the first training epochs during my first generator forward calculation. So did I mess up something or is this behavior expected?
Also is it true that 13.24GB in the error message about GPU memory is allocated by me as well, or is it used by other application? The reason I am asking is I am not doing anything with GPU besides training the model so how in the world did 13.24GB been allocated already?
One last question. So I know in order to fix the problem I need to either reduce the batch size or reduce the model size. However many times I found out when I reduce the batch size and model size, the required allocated memory sometime even increase, for example if I reduce my channel by half in Generator conv layer, the expected memory usage should be reduce by close to half, but sometime the error says the required allocated memory actually increased by half. Why is that happening? Also for the free memory sometime for example it will show I have 80 MB free available. But when I reduce the batch size and re-run again it says I only have 40MB left available? Why is this happening as well?
Sorry for many questions, but the memory problem is just really frustrated to solve