Should the following behaviour be considered a bug or the kind numerical error one must expect from floating point arithmetic?

Code:

```
X1 = torch.rand( (10000,500) )
X2 = torch.rand( (10000,500) )
K = (X1.mm(X2.t()))
print( "CPU", K.mean(1).mean(), K.mean(0).mean(), K.mean() )
X1 = X1.cuda()
X2 = X2.cuda()
Kcuda = (X1.mm(X2.t()))
print( "CUDA", Kcuda.mean(1).mean(), Kcuda.mean(0).mean(), Kcuda.mean() )
```

It samples 10 000 random vectors of dimension 500, computes the inner products, and then computes the empirical mean of the inner products in three mathematically equivalent ways. On GPU the result is equal to the 7-th significant digit, on CPU already the third significant digit differs for one of the equivalent ways.

Output:

```
CPU tensor(124.9895) tensor(124.9895) tensor(126.1465)
CUDA tensor(124.9895, device='cuda:0') tensor(124.9895, device='cuda:0') tensor(124.9895, device='cuda:0')
```