CUDA out of memory even the parameters is less than before

fllci · December 15, 2022, 4:26pm

Hi all!

I’ve trained my 80million parameter -model successfully on the same system.

Few weeks later, I decreased my model network to 18 million parameter but now it is saying that

CUDA out of memory

How is this even possible?

Thanks!

Andrei_Cristea · December 15, 2022, 5:48pm

Hi Furkan,

Two potential culprits jump to mind here:

Your system is running some other GPU-consuming process. Run nvidia-smi to rule this out.
You’ve introduced a bug into your training that is causing your script to do something additional that’s memory-consuming. An example of this would be to keep running track of your losses by adding the tensor loss from each training batch, rather than adding the value of that loss via the .item() method. This would cause you to store in memory an expanding and GPU-consuming computation graph.

Without more detail, it’s hard to give a clear answer, but perhaps you could investigate those two hypotheses and see if anything pops up.

Best,
Andrei

fllci · December 15, 2022, 7:10pm

Thanks Andrei. Your suggestions are probably the reason of this issue. I actually updated the torch to 1.13.1. Then, the model was run without a hitch.
Best,
Furkan