GPU memory not freed after caught error

I’m running into trouble when training my model on GPU.
The model is rather complicated (varying input batch sizes among others) so it is almost impossible to guarantee that there will never be any errors.
At the moment I catch errors both inside my model and during the training loop, which according to this thread that should be safe:
https://discuss.pytorch.org/t/is-it-safe-to-recover-from-cuda-oom/12754

However, as the most recent comment in that thread noted, the memory is not fully freed after an error.
What seems to happen is that when the error some tensors are stored somewhere in a place where I can’t access them (nor clear them with backward() ). The result is a gradual increase in memory usage that can not be cleared at all. This is not just reserved memory, the model will eventually crash with cuda out of memory errors.
Moving the model to cpu, then calling torch.cuda.empty_cache() and then moving it back to gpu does not touch this extra memory consumption. Furthermore, if I delete my model and all references to it, the python garbage collector is still able to find references to tensors stored on gpu even though none should exist.
Attempting to delete those tensors after they are found by the gc does not work either.

Any help would be much appreciated!

torch version: 0.4.1
cuda version: 9.0.176
I was working on a reproducible example but have not been able to create a simple version yet.

Does this depends on the method that created the OOM error?
Also could it be that when the python error is raised, it gets a ref to every local objects where the error occurred. I am not sure how these objects behave but they might be holding onto some tensors.

It was indeed due to the python error. I seemed to have incorrectly printed the error messages, so it kept track of those tensors. Thanks for the help!

1 Like