Why different results when multiplying in CPU than in GPU?

I am not sure if this is Pytorch related…apologies if not.

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
import numpy as np
import torch

a = torch.from_numpy(np.random.rand(5000,100000).astype(np.float32))
b = torch.from_numpy(np.random.rand(5000,100000).astype(np.float32))

c = a.cuda()
d = b.cuda()

print(a.dot(b))
print(c.dot(d))
:<EOF>
124996952.0
124997016.0
1 Like

Hi,

That is very likely due to the limited numerical precision of float32. If you use np.float64, the values should be much closer to each other (but it might be a bit large for the GPU memory).
I believe that torch hands the CPU calculation to a numerical library (with various options), so the method to compute the dot product can be slightly different between the two.

(I have also run into this e.g. when computing the mean over all images on a dataset with float32.)

1 Like

Thanks, you are right about the float64. The number of different digits is similar (depends on the experiment), but they are way more closer numbers.

import numpy as np
import torch

a = torch.from_numpy(np.random.rand(5000,100000).astype(np.float64))
b = torch.from_numpy(np.random.rand(5000,100000).astype(np.float64))

c = a.cuda()
d = b.cuda()

print(a.dot(b))
print(c.dot(d))::::::::::
:<EOF>
125000868.65247717
125000868.65247723

There’s also a possible difference in the execution order of the operations. I suppose dot product in CPU is done in sequence, while in GPU there must be a reduction.
Example:

CPU:

(((a + b) + c) + d)

GPU:

((a+b) + (c+d))
3 Likes

what @kimichaelkills said is the reason for this difference.

1 Like

@kmichaelkills Thanks a lot for the answer. Out of interest, what is the reason for this difference in computing the dot product? Why there must be a reduction in GPU but it is sequential in CPU?

it is because GPUs have thousands of cores, and doing a map-reduce style computation best exploits the parallelism of GPUs

3 Likes