I am doing training on GPU in Jupyter notebook.
I have a problem: whenever I interrupt training GPU memory is not released. So I wrote a function to release memory every time before starting training:
def torch_clear_gpu_mem(): gc.collect() torch.cuda.empty_cache()
It releases some but not all memory: for example X out of 12 GB is still occupied by something. And it seems like this X is growing after every training interruption. And I can release it fully only by restarting the kernel.
But I found a strange workaround: to cause some error to before calling
torch_clear_gpu_mem() - for example dividing by 0 in some cell of the notebook.
Then after I call
torch_clear_gpu_mem() memory is fully released!
Can someone please explain how it happens? Is it some memory leak in Jupyter?
I would like to make a function that does this automatically. Right now I need to call a special cell that causes error before clearing GPU memory.