Pytorch Convolutions with Large Batch Sizes

Hi, I am currently working on a project where I have 10,000 images of size 100x100 and I have to convolve them with a filter of size mxm where m can range from 5 to 100.

I have found that when the size of my kernel is small m<=10, convolving with 10,000 images takes tens of seconds and when my kernel is large m=100 then convolving with 10,000 images take several minutes.

Is there any way to speed up these computations for m small or large? I have tried using fft convolutions but they also don’t give me the desired speed up due to how large my batch size is. My goal is to perform this operation of convolution with 10,000 images in milliseconds if possible.

P.S. If this is not possible, I also know that each one of my 10,000 images that I convolve my kernel with is a rank one matrix if this is of any use.