Why doesn't matrix multiplication speed depend on row/column major ordering?

Given two matrices A and B, both square, my naive assessment would be that the multiplication A * B should be much slower than A * B^T, since in the first case the sum

C_ij = sum_j A_ij * B_ji

does not respect the row-major ordering of B whereas when I am taking the transpose of B it does. Nevertheless I cannot measure a significant difference:

a = torch.empty(5000, 5000).normal_()
b = torch.empty(5000, 5000).normal_()

start = timer()
torch.matmul(a, b)
end = timer()
print(end - start) 

start = timer()
torch.matmul(a, b.T)
end = timer()
print(end - start)

prints

0.7230800580000505
0.6886540420236997

Why is that so?