How to clean GPU memory after a RuntimeError?


(St├ęphane Archer) #1

I try multiple model on my data.
Sometime I have the following RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCTensorMath.cu:35

The model was to big, ok!

The problem is that when I try the next model (really small) I get the same RuntimeError even if i do del old_model before

witch mean that the gpu memory is not free even after a del

Do you have a way to recover from an cuda out of memory?


#2

FairSeq is restoring the training and validation, if they run into OOM issues.
Have a look at this code.

Depending where the OOM error occurred, they just skip the batch or clear the cache.


How to free GPU memory when OOM error occurs?
What's the best way to handle exception "cuda runtime error (2) : out of memory"?
Recover from CUDA Out Of Memory