Why different results when multiplying in CPU than in GPU?

tunante · March 25, 2017, 11:36pm

I am not sure if this is Pytorch related…apologies if not.

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
import numpy as np
import torch

a = torch.from_numpy(np.random.rand(5000,100000).astype(np.float32))
b = torch.from_numpy(np.random.rand(5000,100000).astype(np.float32))

c = a.cuda()
d = b.cuda()

print(a.dot(b))
print(c.dot(d))
:<EOF>
124996952.0
124997016.0

tom · March 26, 2017, 8:52am

Hi,

That is very likely due to the limited numerical precision of float32. If you use np.float64, the values should be much closer to each other (but it might be a bit large for the GPU memory).
I believe that torch hands the CPU calculation to a numerical library (with various options), so the method to compute the dot product can be slightly different between the two.

(I have also run into this e.g. when computing the mean over all images on a dataset with float32.)

tunante · March 26, 2017, 9:13am

Thanks, you are right about the float64. The number of different digits is similar (depends on the experiment), but they are way more closer numbers.

import numpy as np
import torch

a = torch.from_numpy(np.random.rand(5000,100000).astype(np.float64))
b = torch.from_numpy(np.random.rand(5000,100000).astype(np.float64))

c = a.cuda()
d = b.cuda()

print(a.dot(b))
print(c.dot(d))::::::::::
:<EOF>
125000868.65247717
125000868.65247723

kmichaelkills · March 26, 2017, 10:24am

There’s also a possible difference in the execution order of the operations. I suppose dot product in CPU is done in sequence, while in GPU there must be a reduction.
Example:

CPU:

(((a + b) + c) + d)

GPU:

((a+b) + (c+d))

smth · March 26, 2017, 7:33pm

what @kimichaelkills said is the reason for this difference.

tunante · March 26, 2017, 7:39pm

@kmichaelkills Thanks a lot for the answer. Out of interest, what is the reason for this difference in computing the dot product? Why there must be a reduction in GPU but it is sequential in CPU?

smth · March 26, 2017, 7:44pm

it is because GPUs have thousands of cores, and doing a map-reduce style computation best exploits the parallelism of GPUs