What's the best way to handle exception "cuda runtime error (2) : out of memory"?

david-leon · January 5, 2018, 1:42am

Pytorch tends to use much more GPU memory than Theano, and raises exception “cuda runtime error (2) : out of memory” quite often.

Usually I’d do:

catch this exception
reduce the batch size
continue the training iteration after sleeping for a few seconds.

By doing this I expect I wouldn’t have to break the training process and re-start. The above procedure works fine in Theano, with this I can run the training for months when memory exception occurs sometime with big input.
However, for Pytorch this procedure doesn’t hold. Sometimes it works, other times Pytorch keep raising memory exception and the training process must be broken by Ctrl+C.

I’ve tried sleeping for longer time up to 10 seconds, and call torch.cuda.empty_cache() but the problem remains.

This is annoying because either I’ve to check the training status manually all the time, or a separate “watch dog” process has to be designed.

So, is there any official recommendation for handling this exception properly?

SimonW · January 5, 2018, 8:19pm

If when OOM occurs you are still holding onto references (explicitly or internally) to graphs and/or tensors, they will continue occupy memory. Occupied memory is not free-able by empty_cache().

Perhaps manually del the relevant variables will help. But I’m not sure catching OOM like this won’t have other downsides.

david-leon · May 8, 2018, 6:18am

No official solution for such a long time?

avalokoska · January 10, 2019, 10:04am

Any solution to this issue?

ptrblck · January 10, 2019, 8:06pm

You could try the approach from FairSeq in this thread.

avalokoska · January 11, 2019, 8:44am

Thanks, actually I met the problem mentioned here:

When I tried to clear the grad after the exception, another Exception was raised when calling zero_grad().