What would be used for the torch.matmul with transposed inputs?
I think in theory both should use the same gemm path accepting and producing transposed values (ideally, it should depend on “actual” memory contiguity format?)
My expectation is that historically, the conv2d kernel was the most optimized of them all and so it was the simplest and overall fastest fallback.
But if there are easy to identify cases where matmul is faster, then we can definitely add a branch there to call matmul instead!