Memory leak in custom cuda extension

Hi,

I have a relatively large memory leak in my training script. The leaked memory is not allocated in the GPU but rather in RAM. It only occurs when I use a network with a custom cuda kernel. Another version of the network without the cuda extension does not have this issue. As far as I know, this extension does not have any inherent memory and simply calculates the forward/backward pass for a mathematical equation.

I have tried the following:
1) Check the memory in the garbage collector
I used an edited version of this gist to analyze tensors in RAM. I did not find any irregularities there.

2) Use tracemalloc
I again could not find the source of the leak. I did however notice the custom module allocating ~50 MBs of RAM after an epoch, despite the fact that the encompassing network and all of its input were on GPU. At that snapshot, the leak was ~5 GBs large.

The leak can take up to 60GBs in a day and given that tracemalloc cannot find anything, I assume the allocation is happening outside the python interpreter (e.g. inside the extension). Is there any way to pinpoint the problem further?

Thanks in advance