Given two matrices A and B, both square, my naive assessment would be that the multiplication A * B should be much slower than A * B^T, since in the first case the sum
C_ij = sum_j A_ij * B_ji
does not respect the row-major ordering of B
whereas when I am taking the transpose of B
it does. Nevertheless I cannot measure a significant difference:
a = torch.empty(5000, 5000).normal_()
b = torch.empty(5000, 5000).normal_()
start = timer()
torch.matmul(a, b)
end = timer()
print(end - start)
start = timer()
torch.matmul(a, b.T)
end = timer()
print(end - start)
prints
0.7230800580000505
0.6886540420236997
Why is that so?