How to detect and release leaked memory after OOM crash

I am trying to setup a few forward-only model as a daemon process. After setting up the model (inceptionV3 for instance), nvidia-smi shows that x MB of GPU memory is consumed. And after forwarding once, the number becomes (x + 6) MB and becomes constant for following requests (except for some momentarily spikes).

OK, now everything looks fine so far.

But when performing concurrent forwarding, if the GPU is out of memory, the memory that my N models consumed becomes more than (x + 6) * N MB. And this is a rather random action. Sometimes the memory remains the same even when OOM occurs, while it still becomes larger slowly with OOM continuous happening.

What I wanted to do is to detect and release the memory without having to kill the python process, so that the models can constantly provide forward service.

Some notes:

  1. I have tried gc.collect() and empty_cache() and they seems not helping me out.
  2. This memory leak can be recorded by torch.cuda.memory_allocated().
  3. I’ve tried to track the memory leak using the function dump_tensors() provided in, but its result is always constant even if the leak happens, which probably means gc cannot detect the leak?
  4. I’m using pytorch 0.4.1
  5. My guess is that these leaked memory is sort of pre-process CUDA context?

Any help would be greatly appreciated. Thanks guys.