What's the best way to handle exception "cuda runtime error (2) : out of memory"?

Pytorch tends to use much more GPU memory than Theano, and raises exception “cuda runtime error (2) : out of memory” quite often.

Usually I’d do:

  1. catch this exception
  2. reduce the batch size
  3. continue the training iteration after sleeping for a few seconds.

By doing this I expect I wouldn’t have to break the training process and re-start. The above procedure works fine in Theano, with this I can run the training for months when memory exception occurs sometime with big input.
However, for Pytorch this procedure doesn’t hold. Sometimes it works, other times Pytorch keep raising memory exception and the training process must be broken by Ctrl+C.

I’ve tried sleeping for longer time up to 10 seconds, and call torch.cuda.empty_cache() but the problem remains.

This is annoying because either I’ve to check the training status manually all the time, or a separate “watch dog” process has to be designed.

So, is there any official recommendation for handling this exception properly?

2 Likes

If when OOM occurs you are still holding onto references (explicitly or internally) to graphs and/or tensors, they will continue occupy memory. Occupied memory is not free-able by empty_cache().

Perhaps manually del the relevant variables will help. But I’m not sure catching OOM like this won’t have other downsides.

No official solution for such a long time?

Any solution to this issue?

You could try the approach from FairSeq in this thread.

Thanks, actually I met the problem mentioned here:

When I tried to clear the grad after the exception, another Exception was raised when calling zero_grad().