Profiling memory consumption of forward and backward pass

I’m trying to profile the memory used by PyTorch for the forward and backward pass of one minibatch for various CNN layers. When doing so I found that if a certain layer configuration doesn’t fit on the GPU and it throws an out of memory error, the GPU memory does not fully get freed and so if I continue executing with new layer configurations eventually the GPU completely runs out of memory.

I was wondering if there was a way to deallocate all the tensors put on by PyTorch for that layer so the only memory remaining is the default CUDA context?

Do you have a reproducible code snippet for this behavior?
If PyTorch encounters an OOM, it should delete the current allocation, clear the cache and retry the allocation.