I was going through the Python source code for the ConvNd modules and seem to have hit a dead end.
What I am wondering is how the convolutions are being calculated. I.e. if we have the weights and input image, how are these being shaped/multiplied?
Is it done by reshaping the kernel to a classic Toeplitz matrix and then doing matmul?
Or some other method?
matmul approach will be used for the native implementations as given here. Different backends (e.g. cuDNN) could call different algorithms internally, which depend on the workload shape etc.
I found this article helpful in detailing the underworkings of the Conv2d operation:
So the kernel is first applied via matmul with the set stride, dilation, etc to each channel. Then those output channels are summed across channels. That is repeated for each kernel.
In Pytorch, the weights are of size:
out_channels, in_channels, kernel_dim0, kernel_dim1