I have very large kernels (from 63 x 63 to 255 x 255) and would like to perform convolutions on an image of size 512 x 512. Currently, I get OOM errors because I think that PyTorch performs an nn.Unfold operation to treat the convolution as a matrix-vector product.

Is there a way to perform such large convolutional operations, for example using a distributed or a sequential approach?

Depending on your setup and if cuDNN is used, internally a matrix multiplication could be used or any other cuDNN algorithm.
Could you post the activation shape, which is causing the OOM?

Input and output have size n x 21 x 512 x 512, and I pad the image to get the correct sizes.

I can go with n = 1 if needed, however, I prefer to keep > 1. This should be a minor problem since it can be parallelized.

The problem is that the number of input windows (n x 512^2 x k^2) generated by the mini-batch is too large and cannot fit into memory. With n = 1 and k = 255 (my max kernel size) and working with float32, the kernel windows occupy ~63GB of memory.

I was thinking to write a C++ implementation in LibTorch of a customized convolution, that iterates over the pixels of the image and performs the convolution, without the time-efficient / memory-unefficient unfold trick.

Alternatively, I can work in Python by cuting the input image and performing the convolution space-wise (was thinking to use the group parameter of Conv2d but I don’t think it’s possible), taking care of correct padding between the image patches.

The last resort could be to implement a pytorch_sparse unfold operator, considering that the images have lot of pixels set to 0.