Significance of the difference between CPU and GPU results

emreyolcu · March 11, 2018, 2:20am

Consider the following setup (we can suppose the matrix a is a grayscale image):

In [3]: a = (255 * np.random.random([5, 5])).astype(np.uint8)

In [4]: b = torch.cuda.FloatTensor(a.astype(np.float32) / 255)

In [5]: c = torch.cuda.FloatTensor(a) / 255

In [6]: b - c
Out[6]: 

1.00000e-08 *
 -5.9605 -2.9802 -5.9605  0.0000  0.0000
 -5.9605 -1.4901  0.0000 -5.9605 -5.9605
  0.0000 -5.9605 -5.9605  0.0000  0.0000
  0.0000 -1.4901 -1.4901 -5.9605  0.0000
  0.0000 -2.9802  0.0000  0.0000 -5.9605
[torch.cuda.FloatTensor of size 5x5 (GPU 0)]

I know that the difference is due to the limited precision of 32-bit floating numbers. My question: Is there a sense in which b is a more accurate result than c? Computing c is a little faster than computing b when a is large, so is there any advantage to preferring b despite the speed difference?

albanD · March 12, 2018, 10:24am

Both are implementing the floating point number computation standard. So they are both correct (even though different), and you cannot say that one is closer to the “real” answer than the other.
I think the only difference is speed really.

emreyolcu · March 13, 2018, 4:42pm

Answering my own question, and for future reference, it seems there is a sense in which the CPU result is more accurate. While it is correct that both are implementing the standard, the CPU result may be closer to the actual value of the expression. This document explains the difference very well and is worth a read in my opinion: Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs

Here is an example:

FloatTensor = torch.cuda.FloatTensor
DoubleTensor = torch.cuda.DoubleTensor

A = (255 * np.random.random([1000, 1000])).astype(np.uint8)

R_cpu_32 = FloatTensor(A.astype(np.float32) / 255)
R_gpu_32 = FloatTensor(A) / 255

R_cpu_64 = DoubleTensor(A.astype(np.float64) / 255)
R_gpu_64 = DoubleTensor(A) / 255

print(torch.abs(R_cpu_64 - R_cpu_32.double()).mean(),
      torch.abs(R_cpu_64 - R_gpu_32.double()).mean())
print(torch.abs(R_gpu_64 - R_cpu_32.double()).mean(),
      torch.abs(R_gpu_64 - R_gpu_32.double()).mean())

For me, the above fragment prints

9.961657723698238e-09 2.970265274337254e-08
9.96165772978204e-09 2.970265274945634e-08

where even though neither result violates the standard, the CPU result is more accurate.