Based on the error of ~1e-5 you are most likely running into small errors caused by the limited floating point precision.
-
It’s not a magic fix, but will give you more precision, thus reducing the error if you are using a wider
dtype
. -
On GPUs you would expect to see poor performance using
float64
. -
It’s not necessarily only visible between CPU and GPU calculations, but depends on the order of operations which could also change on the same device as seen e.g. here:
x = torch.randn(100, 100)
s1 = x.sum()
s2 = x.sum(0).sum(0)
print((s1 - s2).abs())
# tensor(1.9073e-05)
That’s not the case as both as using the IEEE floating point standard (unless you are using TF32 on Ampere GPUs). Take a look at this Wikipedia article for more information.