This can be expected. On an item by item basis, you have a typical floating point difference for fp32:

(r1-r2).abs().max()
tensor(6.4850e-05)

As you have 4.2e5 items, you end up with a the difference you see. So everything is as expected here, but if you depend on the difference, you probably want fp64 for at least part of your computation.

The thing that I want to know is that what mechanism is underlying torch.sparse.mm() compare to torch.mm() which cause this error?

As I know if we have a1=a2 (dtype = float32 ) and all 32 bits of a1 be identical to a2
and also have b1=b2 (dtype = float32 ) and all 32 bits of b1 be identical to b2 and we do an identical floating-point operation by an FPU and get c1=a1*b1 and c2=a2*b2 we expect that c1=c2 and all 32 bits of c1 be identical to c2.

Well, so summation results depend on the order of summation for floating point. The guarantees that you get the same order are very limited and non-existent between different implementations (or different number of available cores or …).