Strange timing with jit-compiled module during memory allocation

Hi!

There is some strange behavior that I cannot understand. My setup: Ubuntu 16.04.5, pytorch 1.1.0, Tesla V100 SXM2.

I have a simple C++ module empty_module that does memory allocation on GPU and can be called from python (see example code here). There is a boolean flag that controls, whether the allocated memory should be freed (which is the correct behavior).

The default execution time on my machine is roughly 1.5ms. When I set the flag to 0 (the memory is not freed up), the execution time falls by roughly 3 orders of magnitude to 12µs, and the memory leaks, which can be viewed in nvidia-smi. However, if after that, without relaunching the python interpreter, I run the same function again with the flag set to 1 (memory is freed up, memory does not leak), the execution time stays at the same low levels as if the momory is not freed.

Perhaps this is the question for Nvidia developers, but I’m new to this and cannot devise an example without pytorch.
How does this happen?

How do you check that the memory is leaking?
nvidia-smi might show used memory, while it is in fact cached memory by PyTorch to avoid expensive re-allocations.

Each time I run the cell with deallocate=False, the memory usage on my device slightly increases. Calling torch.cuda.empty_cache() does not help to release it.