RuntimeError: CUDA out of memory in train mode

Homa · June 3, 2021, 6:16am

Hi
I get ‘RuntimeError: CUDA out of memory’ in train mode. I can not reproduce the problem anymore,Tried all of these:
. gc.collect()
.torch.cuda.empty_cache()
. set the batch size equal to 1
What is strange, is that the EXACT same code ran fine before. When I tried to run the same code with slightly different hyperparams (doesn’t affect the model,like learning rate and decay) it breaks during the first iteration of the first epoch. Even when I try to run the same hyperparams as my first experiment, it failed. Can you hepl me?thanks.

ptrblck · June 3, 2021, 7:29am

Your GPU memory might be allocated by another process, so you could check it via nvidia-smi and make sure the GPU is empty before starting your training.

Homa · June 3, 2021, 7:43am

I’m testing this. thanks for reply

saba · June 10, 2021, 2:38am

Hi Ptrblck,

I hope you are well. Sorry my job is running and after 500 epoch it stops and gave me this error. would you please help me with that? I am running the code on HPC Bracewell.

RuntimeError: CUDA out of memory. Tried to allocate 1.23 GiB (GPU 0; 15.90 GiB total capacity; 14.78 GiB already allocated; 269.75 MiB free; 198.87 MiB cached)

ptrblck · June 10, 2021, 4:53am

This error indicates that your device is running out of memory and crashes.
This could happen, e.g. in case you are storing unnecessary tensors and are thus increasing the memory usage throughout the training (check the memory usage during the training via nvidia-smi and make sure it’s constant after some iterations/epochs), or e.g. if you are working with variable input shapes and one particular batch contains especially large inputs (use a max. shape in that case), or e.g. if you are using a shared system and another user uses the same GPU and allocated memory, etc.