I was wondering how the convolution operation is implemented for a batch of N images, i.e. for a tensor of shape (N, C, H, W). Is it fully vectorized or is there a for-loop over N images? Also, if we have more filters than one, does it use one additional for-loop over the filters?
There are a variety of convolution algorithms.
Some are calling an (implicit) im2col and use a matrix multiplication afterwards. Other might use an FFT or Winograd approach etc.
Also, if you are using torch.backends.cudnn.benchmark = True
, the first iteration for a new input shape will run some benchmarking and select the fastest kernel.
Thanks ptrblck! In case of im2col
(I couldn’t find the source code, but even if I find it I am afraid it is implemented in c++ and I won’t be able to understand it), if we pass a batch of N images, is it ran N times over each image using a for-loop? Also if we have more than one filters, will it require another for-loop? Or is im2col
somehow cleverly vectorized so that it can handle tensors of shape (N, C, H, W) without for-loops (just a single matrix multiplication)?
im2col
can be applied via unfold
. There should be no loops in the matrix multiplication and the unfolding should be implicit for performance reasons.
The usage of loops might slow down the execution in a lot of use cases, but there might be use cases, where it would make sense to use a loop instead of a dense operation.