Pytorch tends to use much more GPU memory than Theano, and raises exception “cuda runtime error (2) : out of memory” quite often.
Usually I’d do:
- catch this exception
- reduce the batch size
- continue the training iteration after sleeping for a few seconds.
By doing this I expect I wouldn’t have to break the training process and re-start. The above procedure works fine in Theano, with this I can run the training for months when memory exception occurs sometime with big input.
However, for Pytorch this procedure doesn’t hold. Sometimes it works, other times Pytorch keep raising memory exception and the training process must be broken by Ctrl+C.
I’ve tried sleeping for longer time up to 10 seconds, and call torch.cuda.empty_cache()
but the problem remains.
This is annoying because either I’ve to check the training status manually all the time, or a separate “watch dog” process has to be designed.
So, is there any official recommendation for handling this exception properly?