The same deep network, the same input and code, the results after the decimal point are inconsistent on different hardware graphics cards

The same deep network, the same input and code, the results after the decimal point are inconsistent on different hardware graphics cards. first i thought maybe the random seed lead to this. however, i still get the inconsistent result with identical random seed. The difference is only after the decimal point on (2070 Vs 2080Ti or 3060), is this normal?
what should i do to make the results same?

It could be expected depending on the relative error and used operation.
E.g. a change in the order of operation creates mismatches due to the limited floating point precision:

x = torch.randn(100, 100)
s1 = x.sum()
s2 = x.sum(0).sum(0)
print((s1 - s2).abs())
# tensor(3.0518e-05)

but i can get the same result use your code at different GPUS with identical random seed. This only means that after the random number is determined, the random results will be consistent. i dont think this is answer to my question. i still cannot figure out why my results are diffrent.

No it doesn’t and would still depend on the used algorithm. I.e. if the sum kernel outputs deterministic results, your comparison is valid. If not, different results are expected.

i think i understand what your means, the diffrences do eixsit.


However, this difference starts with three decimal places and cannot be seen if the value is too large. But when I set 10,000 random values, the difference can be observed. which is same situation as my results.