Strange CUDA Memory error

I am getting the following error. I am using batch size 1.
I was able to train the very same network several times with batch size 6. the code does not show any error untill i use model.cuda(). Before i use model.cuda() the gpu memory fills up with 10gb.

RuntimeError: CUDA out of memory. Tried to allocate 10.24 GiB (GPU 3; 14.76 GiB total capacity; 10.31 GiB already allocated; 4.04 GiB free; 10.52 MiB cached)

When i first trained my model it used 10gb but now tries to use 10more gb.

Could you check, which part of the code fills up the GPU memory?
Are you loading a state_dict directly to the device or any other data?

This error occurs when i put the model for training.
The model architecture is initialized to an object model . This is where the memory fills up. then i run model.cuda() -> this point it tries to allocate that 10.24 GiB, results in cuda mem error.

Here is the story:
Training for 1st time: works fine. used gpu 2,3 (takes 25min for one epoch) -> process quit
Training for 2nd time: works fine. switched to gpu 2 (takes 17min for one epoch) -> process quit because i wanted to save best model. So made changes to check val loss and save.
Training for 3rd time: above described cuda error.

I used the same code to run on a different system worked like a charm.

This would mean that you are already creating CUDATensors inside the __init__ method of your model. Could you check for to('cuda') and cuda() calls inside it?
The next model.cuda() call would then push all remaining parameters and buffers to the GPU and is apparently running out of memory.

1 Like

The issue was resolved. The was a small change in the FC layer as my initial code ran for 256x256 but in 512x512 the output tensors was very huge nearly 4times the previous model. I reduced the output features in nn.Linear by factor of 10000.