I’m running into trouble when training my model on GPU.
The model is rather complicated (varying input batch sizes among others) so it is almost impossible to guarantee that there will never be any errors.
At the moment I catch errors both inside my model and during the training loop, which according to this thread that should be safe:
https://discuss.pytorch.org/t/is-it-safe-to-recover-from-cuda-oom/12754
However, as the most recent comment in that thread noted, the memory is not fully freed after an error.
What seems to happen is that when the error some tensors are stored somewhere in a place where I can’t access them (nor clear them with backward() ). The result is a gradual increase in memory usage that can not be cleared at all. This is not just reserved memory, the model will eventually crash with cuda out of memory errors.
Moving the model to cpu, then calling torch.cuda.empty_cache() and then moving it back to gpu does not touch this extra memory consumption. Furthermore, if I delete my model and all references to it, the python garbage collector is still able to find references to tensors stored on gpu even though none should exist.
Attempting to delete those tensors after they are found by the gc does not work either.
Any help would be much appreciated!
torch version: 0.4.1
cuda version: 9.0.176
I was working on a reproducible example but have not been able to create a simple version yet.