Handling GPU/CPU compute differences

Simon_Watson · July 27, 2022, 5:56am

Hi All,

I am porting a computational graph from tensorflow v1 to pytorch and have hit an issue with my float32 data.
The illustrative example below sums up the issue. I get a result which is a vector of either 37059.996 or 37061.0 depending on the hardware used for calculation with a float32 tensor.

# set up arrays in numpy:
A = np.repeat(np.array([[1.0], [33.0], [0.0],  [1089.0], [0.0], [0.0], [35937.0], [0.0], [0.0], [0.0],  [0.9991], [0.0]]),1000,axis =1).T.astype('float32')
B = torch.empty(12,1).fill_(1.).astype('float32')

# show the dot product of the arrays. 
print(np.dot(A,B))

# take the arrays to the GPU and do the same math
print(torch.mm(torch.tensor(A,dtype=torch.float32).to('cuda'),torch.tensor(B,dtype=torch.float32).to('cuda')).to('cpu').detach().numpy())

# show the difference from the two approaches:
print(np.dot(A,B)-torch.mm(torch.tensor(A,dtype=torch.float32).to('cuda'),torch.tensor(B,dtype=torch.float32).to('cuda')).to('cpu').detach().numpy())

I’m completely open to going through my code and switching to torch.float64. However, before I do, could you advise:

is switching to float64 the best solution or is there a ‘magic fix’ that newer users such as myself might not be aware of?
does anyone have experience of the kind of performance impact the float32 to float64 change incurs?
is this inconsistency between GPU and CPU computation understood and is there any documentation on the error bounds a person might expect? (I get that the cuda representation of a number is not actually the same as a numpy representation and am happy to learn more and accommodate if the documentation is available. )

Thanks and regards,

Simon

ptrblck · July 27, 2022, 6:05am

Based on the error of ~1e-5 you are most likely running into small errors caused by the limited floating point precision.

It’s not a magic fix, but will give you more precision, thus reducing the error if you are using a wider dtype.
On GPUs you would expect to see poor performance using float64.
It’s not necessarily only visible between CPU and GPU calculations, but depends on the order of operations which could also change on the same device as seen e.g. here:

x = torch.randn(100, 100)
s1 = x.sum()
s2 = x.sum(0).sum(0)
print((s1 - s2).abs())
# tensor(1.9073e-05)

That’s not the case as both as using the IEEE floating point standard (unless you are using TF32 on Ampere GPUs). Take a look at this Wikipedia article for more information.