Is there a host memory leak in cudnn_convolution_backward?

I’m creating a custom convolution layer (using PyTorch 10.2) and experiencing a memory leak (on host
not device) whenever I use stride=2. So far, I’ve isloated the memory leak to calling cudnn_convolution_backward (basically this code) with mask (True, True). If I use the mask (False, True) (i.e. not call conv backward for input) then the leak goes away. It also goes away when I use stride=1. I’m using the function in Python via torch.utils.cpp_extension.load. Is this a known issue? I couldn’t find anything on Github.

I’m not aware of a memory leak in any cudnn call right now. Are you seeing the same “leak” using PyTorch calls or only using your custom extension?

There’s no leak if I use the regular PyTorch Conv2d layer.

It seems like it’s not in the cudnn function itself but rather somehow in transferring data to host.
There’s not leak if I do

grads_host[:,:,top:bottom,:].copy_(grad_input_from_cudnn)

But there’s a leak if I instead execute

grads_host[:,:,top:bottom,:] += grad_input_from_cudnn.to(torch.device('cpu'))

I need to use the latter but I don’t know what’s wrong with this line. Any tips?

The second code accumulates the CPUTensor grad_input_from_cudnn to grad_host and thus also stores the entire attached computation graph with it, which would be expected behavior and not a leak.
You might want to check if this is the case and if you would need to detach the tensor before the accumulation (which would depend on the actual use case).

I thought in torch.autograd.Function’s backward function, I wouldn’t need to detach anything but I tried it and it doesn’t solve the issue. The memory still keeps going up with each step.
Also, I’m still not experiencing any leak for stride=1, only when stride is greater than 1.
My temporary workaround is to transfer the grads to GPU, perform addition there and then use copy_ to transfer them back. But that’s of course computationally more expensive.

Just found out that the leak is slower if I do

grads_host = grads_host.detach()

but still no clue what the actual problem is.

Can I somehow turn off computational graph construction in the forward and backward functions of torch.autograd.Function? I thought this happens anyway in those functions but apparently not.