I have very large kernels (from 63 x 63 to 255 x 255) and would like to perform convolutions on an image of size 512 x 512. Currently, I get OOM errors because I think that PyTorch performs an nn.Unfold operation to treat the convolution as a matrix-vector product.
Is there a way to perform such large convolutional operations, for example using a distributed or a sequential approach?
Depending on your setup and if cuDNN is used, internally a matrix multiplication could be used or any other cuDNN algorithm.
Could you post the activation shape, which is causing the OOM?
Input and output have size n x 21 x 512 x 512, and I pad the image to get the correct sizes.
I can go with n = 1 if needed, however, I prefer to keep > 1. This should be a minor problem since it can be parallelized.
The problem is that the number of input windows (n x 512^2 x k^2) generated by the mini-batch is too large and cannot fit into memory. With n = 1 and k = 255 (my max kernel size) and working with float32, the kernel windows occupy ~63GB of memory.
I was thinking to write a C++ implementation in LibTorch of a customized convolution, that iterates over the pixels of the image and performs the convolution, without the time-efficient / memory-unefficient unfold trick.
Alternatively, I can work in Python by cuting the input image and performing the convolution space-wise (was thinking to use the group parameter of Conv2d but I don’t think it’s possible), taking care of correct padding between the image patches.
The last resort could be to implement a pytorch_sparse unfold operator, considering that the images have lot of pixels set to 0.
Yes! I remember I contributed to a repo for performing convolution in the Fourier space, which was the solution in my case. You can find more information here.