Thanks for the info, just now getting back to this after the holidays. The cudnn benchmark certainly helped successive calls, but even in GPU world larger convolutions are slower than scipy’s CPU implementation. My understanding is that CUDA kernels are supposed to switch to fft at an appropriate time, as talked about in this other forum post on fft convolutions. @kevinj22 did you ever figure out your timing discrepancies?
Also, just to confirm: if I want to do my own convolutions on larger kernels, I’ll need to create my own functions in the Pytorch framework so that the Pytorch autograd can keep track of the operations (i.e. there’s no way to preserve autograd tracking going to numpy to use scipy’s fft implementation)?