Example in CPU implementation of Conv1d seems to work non-deterministically · Issue #116369 · pytorch/pytorch · GitHub. For both cases there, conv2d kernels are used
What would be used for the
torch.matmul with transposed inputs?
I think in theory both should use the same gemm path accepting and producing transposed values (ideally, it should depend on “actual” memory contiguity format?)