How to clean GPU memory after a RuntimeError?

(St├ęphane Archer) #1

I try multiple model on my data.
Sometime I have the following RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/

The model was to big, ok!

The problem is that when I try the next model (really small) I get the same RuntimeError even if i do del old_model before

witch mean that the gpu memory is not free even after a del

Do you have a way to recover from an cuda out of memory?


FairSeq is restoring the training and validation, if they run into OOM issues.
Have a look at this code.

Depending where the OOM error occurred, they just skip the batch or clear the cache.

