GPU memory is released only after error output in notebook

Hello!

I am doing training on GPU in Jupyter notebook.
I have a problem: whenever I interrupt training GPU memory is not released. So I wrote a function to release memory every time before starting training:

def torch_clear_gpu_mem():
    gc.collect()
    torch.cuda.empty_cache()

It releases some but not all memory: for example X out of 12 GB is still occupied by something. And it seems like this X is growing after every training interruption. And I can release it fully only by restarting the kernel.

But I found a strange workaround: to cause some error to before calling torch_clear_gpu_mem() - for example dividing by 0 in some cell of the notebook.
Then after I call torch_clear_gpu_mem() memory is fully released!
Can someone please explain how it happens? Is it some memory leak in Jupyter?
I would like to make a function that does this automatically. Right now I need to call a special cell that causes error before clearing GPU memory.

If you are working interactively in your notebook, all objects including tensors will be stored.
Even if you delete all tensors and clear the cache (usually not necessary), the CUDA context will still be initialized and will take some memory.

I guess the error is killing the kernel and thus the memory is completely released.

Hello @ptrblck! Thanks for answering.

Yes, I think CUDA context takes around 900MB for me, but after interruption I get 3-4GB that I cannot release (by deleting all objects / exiting function scope).
I produce some simple exception (like ZeroDivision) and it doesn’t kill kernel (cause I can still access all my variables from other cells after error).
But when I then call torch_clear_gpu_mem(), CUDA memory reliably returns to 900MB.

If you create the division by zero and raise the exception, all GPU memory is cleared and you can still access all CUDATensors?

I can then start my training process on GPU again with all memory available (like after kernel restart)

OK, this would mean the exception is in fact restarting the kernel.
What did you mean by “I can still access all my variables from other cells after error”?

If you still can access and e.g. print old variables, the kernel should be alive and it’s strange that all GPU memory is cleared.
On the other hand, if you cannot print old variables and just restart the training, what would the difference be to a clean restart instead of a restart caused by an exception?

I have a train_cell: where I call my training function (there I have a function and I call it right after definition).
But this cell requires all prev cells (prep_cells) to be executed (imports, data preparation, etc.)

I also have an exception_cell where I do 1/0 (ZeroDivision exception).
And mem_clear_cell where I call torch_clear_gpu_mem()

  1. I run prep_cells, train_cell
  2. Interrupt training, GPU memory usage: 9GB
  3. Run mem_clear_cell, GPU memory usage: 4GB (no matter how many times I run clear, gpu memory usage doesn’t go down)
  4. Run exception_cell, then mem_clear_cell, GPU memory usage: 935MB
  5. Then I can just call train_cell without calling prep_cells

So with exception approach to clearing GPU memory I don’t need to restart kernel and run prep_cells before running train_cell