Pytorch for larger convolutions (slow)

I’ve run into an issue where larger convolutions are slow in pytorch (e.g. a 128x128 convolution with a 512x512 image). Is there something I can do to speed it up, or maybe I’m doing something incorrectly?

import torch
import numpy as np
from skimage.color import rgb2gray
from skimage.data import astronaut
import matplotlib.pyplot as plt
from scipy.signal import fftconvolve
import time

img_now = rgb2gray(astronaut())
psf_extent = 128
x = np.linspace(-1, 1, psf_extent)
y = np.linspace(-1, 1, psf_extent).reshape((-1, 1))
# build asymmetric psf so no specific tricks (e.g. separable, DCT, etc.) can be used
sigma_sqd = (x+2)**5 + (y+2)**5
psf_now = np.exp(-((x+.1)**2 + (y+.1)**2)/(1e-3*sigma_sqd))
tp = torch.Tensor.view(torch.from_numpy(img_now), 1, 1, img_now.shape[0], img_now.shape[1])
tp.requires_grad_(False)
pp = torch.Tensor.view(torch.from_numpy(psf_now), 1, 1, psf_now.shape[0], psf_now.shape[1])
pp.requires_grad_(False)

start = time.time()
blur_img = torch.nn.functional.conv2d(tp, pp, padding=psf_now.shape[0] // 2 - 1)
print('pytorch conv2d elapsed: %fs' % (time.time() - start))
start = time.time()
blur_scipy = fftconvolve(img_now, psf_now, mode='same')
print('scipy fftconvolve elapsed: %fs' % (time.time() - start))

I get ~25s for pytorch implementation vs ~.04s for scipy, which I could see if pytorch were summing an actual sliding window but it’s my understanding that pytorch (or at least the latest v1.0) infers which type of convolution to do. Even for smaller convolutions, changing psf_exent above, the scipy version seems to still do better. Going to GPU helps, goes to ~2s on my machine, but this doesn’t seem to be the type of operation I should force into GPU.

Thanks for the report!
Could you have a look at your memory usage during the PyTorch conv operation?
Currently it tries to allocate 31GB on my machine (using the CPU).

Thanks for taking a look, ptrblck. I can also confirm that on my machine that the PyTorch conv operation wants to use 31GB of memory (using the CPU).

Thanks for the info.
It certainly looks like unwanted behavior to me, so maybe @albanD could give some input on this.

Hi,

I’m afraid that on cpu, we have a single convolution implementation. Which is one based on matrix matrix multiplication. This is very good for small kernel but will create huge matrices if the kernel ia big. I don’t remember any plan to add more implementations.

If you’re using the gpu, by using torch.backends.cudnn.benchmark=True you should be able to get an algorithm that will be efficient for your use case.

1 Like

Thanks for the info, just now getting back to this after the holidays. The cudnn benchmark certainly helped successive calls, but even in GPU world larger convolutions are slower than scipy’s CPU implementation. My understanding is that CUDA kernels are supposed to switch to fft at an appropriate time, as talked about in this other forum post on fft convolutions. @kevinj22 did you ever figure out your timing discrepancies?

Also, just to confirm: if I want to do my own convolutions on larger kernels, I’ll need to create my own functions in the Pytorch framework so that the Pytorch autograd can keep track of the operations (i.e. there’s no way to preserve autograd tracking going to numpy to use scipy’s fft implementation)?

Hi,

You will have to create a new Function indeed. But your forward and backward methods can simply call into scipy’s fft implementation. Also keep in mind that moving datas from the cpu to the gpu is not free and can impact performances if you do it a lot.