Different Results for same reduction operator running on GPU and CPU

jmaronas · January 21, 2019, 12:00pm

Today I realized that operation argmax, and consequently max, output different results when running on CPU and GPU. How can be this possible? In the problem I address the difference is 10 samples, which in my opinion is quite a lot.

albanD · January 21, 2019, 12:40pm

Hi,

It can happen if there are values that are equal. One of the argmax will be returned but not necessarily the same on CPU and GPU (not even on two different cpus or or two different gpus).

jmaronas · January 21, 2019, 12:55pm

Well, I suppose that this happens because the reduction algorithms are different both for CPU and GPU. However, based on my experience programming GPUs directly on CUDA, I do not remember that Nvidia provide an API for reduction operators. Thus, could not be programmed equally both on GPU and CPU?

As example I checked what you say:

    import torch

    a=torch.zeros((100,)).uniform_()
    a[a<=0.5]=0
    b=a.cuda()
    a.argmax()#reurn 1
    b.argmax()#return 16

albanD · January 21, 2019, 1:10pm

Hi,

The thing is that for performance reasons, these reduction algorithms are not even deterministic. Or running on different architecture would give you different result.
These operations cannot be programmed the same way on cpu and gpu as well because they are quite different. One uses OpenMP based reductions while the other use cuda’s block wide reductions. Both split the Tensor differently and the order in which the results are aggregated can change.

jmaronas · January 21, 2019, 1:23pm

Ok thanks for your answer. Even so, I think this has to be taken in consideration mostly when doing research.

albanD · January 21, 2019, 2:30pm

Other edge cases like handling of NaNs have been made consistent in recent commits. But for equal values, there is no easy solution that would make them consistent without loosing (a lot) performance wise.