Cuda Out of Memory, even when I have enough free [SOLVED]

marcoramos · March 15, 2021, 5:07pm

EDIT: SOLVED - it was a number of workers problems, solved it by lowering them

I am using a 24GB Titan RTX and I am using it for an image segmentation Unet with Pytorch,

it is always throwing Cuda out of Memory at different batch sizes, plus I have more free memory than it states that I need, and by lowering batch sizes, it INCREASES the memory it tries to allocate which doesn’t make any sense.

here is what I tried:

Image size = 448, batch size = 8

“RuntimeError: CUDA error: out of memory”

Image size = 448, batch size = 6

“RuntimeError: CUDA out of memory. Tried to allocate 3.12 GiB (GPU 0; 24.00 GiB total capacity; 2.06 GiB already allocated; 19.66 GiB free; 2.31 GiB reserved in total by PyTorch)”

is says it tried to allocate 3.12GB and I have 19GB free and it throws an error??

Image size = 224, batch size = 8

“RuntimeError: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 0; 24.00 GiB total capacity; 2.78 GiB already allocated; 19.15 GiB free; 2.82 GiB reserved in total by PyTorch)”

Image size = 224, batch size = 6

“RuntimeError: CUDA out of memory. Tried to allocate 344.00 MiB (GPU 0; 24.00 GiB total capacity; 2.30 GiB already allocated; 19.38 GiB free; 2.59 GiB reserved in total by PyTorch)”

reduced batch size but tried to allocate more ???

Image size = 224, batch size = 4

“RuntimeError: CUDA out of memory. Tried to allocate 482.00 MiB (GPU 0; 24.00 GiB total capacity; 2.21 GiB already allocated; 19.48 GiB free; 2.50 GiB reserved in total by PyTorch)”

Image size = 224, batch size = 2

“RuntimeError: CUDA out of memory. Tried to allocate 1.12 GiB (GPU 0; 24.00 GiB total capacity; 1.44 GiB already allocated; 19.88 GiB free; 2.10 GiB reserved in total by PyTorch)”

Image size = 224, batch size = 1

“RuntimeError: CUDA out of memory. Tried to allocate 1.91 GiB (GPU 0; 24.00 GiB total capacity; 894.36 MiB already allocated; 20.94 GiB free; 1.03 GiB reserved in total by PyTorch)”

Even with stupidly low image sizes and batch sizes…

EDIT: SOLVED - it was a number of workers problems, solved it by lowering them

saandeep_aathreya · March 15, 2021, 8:41pm

I think there is a GPU memory leak. These are usually caused if you are not detaching your gradient information during inferring and saving your variables to a list. Can you post your source code here?

marcoramos · March 16, 2021, 11:38am

I think I solved the problem, it was a number of workers problem, lowered them and it seems ok now

Manuel_Alejandro_Dia · March 16, 2021, 2:08pm

You can always run nvidia-smi to see if the processes that you launch are the only ones consuming GPU memory.

There may be other system processes that use the GPU, but they usually don’t use more than 100MB normally (On Ubuntu).