I am not sure if this is Pytorch relatedâ€¦apologies if not.

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
import numpy as np
import torch
a = torch.from_numpy(np.random.rand(5000,100000).astype(np.float32))
b = torch.from_numpy(np.random.rand(5000,100000).astype(np.float32))
c = a.cuda()
d = b.cuda()
print(a.dot(b))
print(c.dot(d))
:<EOF>
124996952.0
124997016.0

That is very likely due to the limited numerical precision of float32. If you use np.float64, the values should be much closer to each other (but it might be a bit large for the GPU memory).
I believe that torch hands the CPU calculation to a numerical library (with various options), so the method to compute the dot product can be slightly different between the two.

(I have also run into this e.g. when computing the mean over all images on a dataset with float32.)

Thanks, you are right about the float64. The number of different digits is similar (depends on the experiment), but they are way more closer numbers.

import numpy as np
import torch
a = torch.from_numpy(np.random.rand(5000,100000).astype(np.float64))
b = torch.from_numpy(np.random.rand(5000,100000).astype(np.float64))
c = a.cuda()
d = b.cuda()
print(a.dot(b))
print(c.dot(d))::::::::::
:<EOF>
125000868.65247717
125000868.65247723

Thereâ€™s also a possible difference in the execution order of the operations. I suppose dot product in CPU is done in sequence, while in GPU there must be a reduction.
Example:

@kmichaelkills Thanks a lot for the answer. Out of interest, what is the reason for this difference in computing the dot product? Why there must be a reduction in GPU but it is sequential in CPU?