How to fully optimize a custom convolution layer?

I have implemented a custom conv layer, much like here. However it is still memory inefficient compared to native pytorch convlution layer, and slower. How can I fix both speed and memory consumption issues? Any drastic measure is accepted, like coding in the lowest level and building from that, since these issues are really important to me. An ideal situation is where my conv layer is as fast and memory efficient as the native one. I would appreciate any suggestions.

You could directly reuse the native convolutions used in PyTorch to speed up your slower approach.
The cuDNN conv implementations are closed source, so you would need to check (open) reference implementations in CUDA (e.g. you could check the convs in cutlass as another reference).

Thanks, however, I need to modify the computations in forward and backward passes of the conv layer. I need to know exactly how functional.conv2d is calling the cuDNN API.

The cuDNN backend API calls are defined here so you might want to check this file to see how cuDNN is used.

1 Like

@ptrblck Thanks! I followed the link and it was helpful. However, I have another idea for the optimization of memory consumption. In this code, we save input, weight, bias in the ctx for backward pass computations, which is supposedly the source of huge memory consumption. Since we have already allocated these tensors elsewhere(e.g. weight is th the torch.Parameter in a conv layer). I wonder how we can avoid this redundancy. At least we know that the native PyTorch implementation of Conv layer has already solved this problem. Do you know any alternates for this purpose?

I haven’t checked the code in detail, but would expect the weight and bias to be generally tiny compared to the stored activation, since the spatial size is usually larger in the incoming activation.

1 Like