Why is torch.mm for sparse CSR tensor slower in 1.10.2 than 1.9.1?

Hi, I have noticed that using torch.mm operation using tensor with sparse_csr layout is significantly slower in pytorch 1.10.2 than in pytorch 1.9.1.

Here is a small code sample I’ve used to test it:

import timeit


crows = torch.randint(0, 50, (1000000,)).cumsum(0).int()
crows[0] = 0
cols = torch.randint(0, 100000, (crows[-1],)).sort()[0].int()
mock_vals = torch.rand_like(cols, dtype=torch.float)

csr_mat = torch._sparse_csr_tensor(crow_indices = crows, col_indices=cols, values=mock_vals)
dense_mat = torch.rand(csr_mat.shape[-1], 20)

print(timeit.timeit("torch.mm(csr_mat, dense_mat)", globals=globals(), number=100))

In pytorch version 1.9.1, this code yields:


In pytorch version 1.10.2 (same machine), almost same code yields (except this time I use torch.sparse_csr_tensor instead of torch._sparse_csr_tensor):


As you can see there is a significant difference in times between those versions. I would like to upgrade to pytorch 1.10.2 but I would also like to use sparse_csr matrix format as it provides speed up in my project. However I’d expect that matrix multiplication would be faster for newer pytorch version. Have I overlooked something?

I guess the implementation might depend on the used CPU. In my setup I get:

  • ~1.85s in 1.9.1
  • ~2.02s in 1.11.0.dev20220108

which shows a regression, but isn’t as large as your observation.

I will check if 1.11.0 fixes it, but in my project I need to depend on stable version, so I think I have to investigate the difference.

On my setup I have 2x Intel(R) Xeon(R) E5-2696 v4 with AVX2 support.
Does sparse torch.mm method have different implementation between 1.9.1 and 1.10.2 or perhaps it is some difference in default variables / build configuration?

You could use print(torch.__config__.show()) to check if the same libraries are installed (e.g. MKL etc.).