What's faster? 1 tall convolution or many short convolution stacked?

this is strictly about inference rather than training

let’s say i want to produce a stack of 50 filtered images from one 1 channel input image. is it faster to convolve with a stack of 50 filters or to convolve with (e.g.) 10 stacks of 5 filters and then torch.stack?

right now i’m performing the former - building a convolutional unit with 50 filters and then convolving with the image. i profiled and i’m surprised by how slow it is: ~3s for one inference pass. does that seem right?

just in case anyone XYs me: i have a very specific reason for doing this and I am in fact needing 50 filtered images.


Edit: I should probably mention that the kernels are pretty wide as well ~50x50 - ~100x100

Maybe just %time both approaches if you are using jupyter? As it is taking just 3s, you can certainly experiment.
Interested in your findings as well!

%time actually led me astray - it was nvprof that revealed to me the real time.

And which one of them turned out to be faster?

i haven’t compared them head to head. what i’m saying is when i timed my implementation initially using just %timeit i was misled to believe that much of my time cost was host to device copy and back whereas nvprof reveals that it’s actually the convolution.