Is there a host memory leak in cudnn_convolution_backward?

rinkujadhav2013 · January 10, 2023, 1:12am

I’m creating a custom convolution layer (using PyTorch 10.2) and experiencing a memory leak (on host
not device) whenever I use stride=2. So far, I’ve isloated the memory leak to calling cudnn_convolution_backward (basically this code) with mask (True, True). If I use the mask (False, True) (i.e. not call conv backward for input) then the leak goes away. It also goes away when I use stride=1. I’m using the function in Python via torch.utils.cpp_extension.load. Is this a known issue? I couldn’t find anything on Github.

ptrblck · January 10, 2023, 1:18am

I’m not aware of a memory leak in any cudnn call right now. Are you seeing the same “leak” using PyTorch calls or only using your custom extension?

rinkujadhav2013 · January 10, 2023, 1:22am

There’s no leak if I use the regular PyTorch Conv2d layer.

rinkujadhav2013 · January 10, 2023, 2:50am

It seems like it’s not in the cudnn function itself but rather somehow in transferring data to host.
There’s not leak if I do

grads_host[:,:,top:bottom,:].copy_(grad_input_from_cudnn)

But there’s a leak if I instead execute

grads_host[:,:,top:bottom,:] += grad_input_from_cudnn.to(torch.device('cpu'))

I need to use the latter but I don’t know what’s wrong with this line. Any tips?

ptrblck · January 10, 2023, 5:44am

The second code accumulates the CPUTensor grad_input_from_cudnn to grad_host and thus also stores the entire attached computation graph with it, which would be expected behavior and not a leak.
You might want to check if this is the case and if you would need to detach the tensor before the accumulation (which would depend on the actual use case).

rinkujadhav2013 · January 10, 2023, 11:43am

I thought in torch.autograd.Function’s backward function, I wouldn’t need to detach anything but I tried it and it doesn’t solve the issue. The memory still keeps going up with each step.
Also, I’m still not experiencing any leak for stride=1, only when stride is greater than 1.
My temporary workaround is to transfer the grads to GPU, perform addition there and then use copy_ to transfer them back. But that’s of course computationally more expensive.

rinkujadhav2013 · January 10, 2023, 2:08pm

Just found out that the leak is slower if I do

grads_host = grads_host.detach()

but still no clue what the actual problem is.

rinkujadhav2013 · January 10, 2023, 3:50pm

Can I somehow turn off computational graph construction in the forward and backward functions of torch.autograd.Function? I thought this happens anyway in those functions but apparently not.