CUDA Out of Memory even though the model and input fit into memory

there’s this weird thing happening with me, i have a custom Residual UNet, that has about 34M params, and 133MB, and input is of batch size 512, (6, 192, 192), everything should fit into memory, although it doesn’t, it crashes consuming the entire gpu memory

here’s the model:

here’s the colab file running the model:

i have no clue what the problem is, at one point of time everything had worked, on a P100, but that magic moment never came, i couldn’t figure out why it worked or why is it not working now anymore.

are there any memory leaks somewhere, somehow ? that i’m blind enough not to notice ?

even crazier thing is, sometimes even though i delete the model, i am not able to free the model’s occupied memory from gpu, it just stays there, until i reset the runtime, so the one time that the model had accidentally run successfully, i couldnt run the model in eval mode, it just crashed allocating memory.

you need to reduce your batch size.

i tried to reduce the batch_size to 64 and ran the model, the model runs on train mode, but then i cannot run test on it, it crashes,
i tried deleting the input tensor and clearing cache, didn’t help

i was just testing which batch size would work, for now i see that batch size 32 works, but sometimes it does sometimes it doesn’t. will do a little more testing and come back here.

there is no training loop, i am just checking if i can atleast do a forward pass across my entire dataset.


You can check the doc about how we manage the CUDA memory here.
In particular, this will explain why the memory is not returned to the OS when you delete your model.

For trying batch sizes, there are many things that can change the way the memory is allocated on the GPU and so, because of the caching allocator, will slightly change the memory usage. This is unfortunately expected and so you want to keep a liitle bit of extra memory free to make sure you don’t have any issue.

everything should fit into memory, although it doesn’t, it crashes consuming the entire gpu memory

I don’t think you took into account the memory used for the gradients for each parameter. And the memory used by the different states needed to compute the backward pass.
These states are expected to be the most consuming in terms of memory. If you have big memory limitation, you can use modules like the checkpoint one here to reduce the memory used by the states at the cost of doing extra computations.

1 Like

reduce batch size, reduce image size, clear cache with:

  import torch, gc

i realized my model was way too dense to keep track of gradients, so i converted my ops to addition ops
torch.cuda.empty_cache() works, but the problem is it takes up training time, so i called it once after every epoch, and now everything works.

thanks all

I wouldn’t recommend empty_cache at all. It will be called if you’re about to run out of memory automatically, and, as you saw, it will slow down your training.

What would come the closest to emptying out whats on the GPU?

The only thing pytorch puts on the GPU is the cuda runtime (that we don’t control and can’t deallocate) and Tensors.
To remove the Tensors, you simply need to stop referencing them from python.

1 Like