Difference b/w F.grad.conv2d_input and cudnn_convolution_backward

I’m facing a memory leak (or morel likely increased objects) in my custom autograd function’s backward pass when I use a wrapper around cudnn_convolution_backward. If I switch to F.grad.conv2d_input, the CPU mem increase with each epoch is gone but I can’t use it because it’s much slower than using a wrapper around the cudnn function.
So, the question is, what is at::cudnn_convolution_backward_input() doing that could be leading to more increased memory usage?
I’m keeping the output of the layer and the grad for input on CPU and transferring it to GPU for the fwd and bwd passes.
Also, the value of len(gc.get_objects()) remains same so either some pytorch objects are not tracked by gc(?) or an object is growing(?)