Here is the code
w_np = np.random.randn(4096).astype(np.float32)
R_np = np.random.randn(4096, 64).astype(np.float32)
R_np_c = R_np.copy(order='C')
R_np_f = R_np.copy(order='F')
w_torch = torch.from_numpy(w_np)
R_torch_c = torch.from_numpy(R_np_c)
R_torch_f = torch.from_numpy(R_np_f)
%timeit w_np@R_np_c
# 15.9 µs ± 76.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit w_np@R_np_f
# 4.64 µs ± 23.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit w_torch@R_torch_c
# 26.4 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit w_torch@R_torch_f
# 28 µs ± 19.7 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
it is clear that numpy version speedup with fortran order as desired while torch not, is there any problem with my code? how can i achieve the same performance gain using torch