Different Results for same reduction operator running on GPU and CPU

Today I realized that operation argmax, and consequently max, output different results when running on CPU and GPU. How can be this possible? In the problem I address the difference is 10 samples, which in my opinion is quite a lot.


It can happen if there are values that are equal. One of the argmax will be returned but not necessarily the same on CPU and GPU (not even on two different cpus or or two different gpus).

Well, I suppose that this happens because the reduction algorithms are different both for CPU and GPU. However, based on my experience programming GPUs directly on CUDA, I do not remember that Nvidia provide an API for reduction operators. Thus, could not be programmed equally both on GPU and CPU?

As example I checked what you say:

    import torch

    a.argmax()#reurn 1
    b.argmax()#return 16


The thing is that for performance reasons, these reduction algorithms are not even deterministic. Or running on different architecture would give you different result.
These operations cannot be programmed the same way on cpu and gpu as well because they are quite different. One uses OpenMP based reductions while the other use cuda’s block wide reductions. Both split the Tensor differently and the order in which the results are aggregated can change.


Ok thanks for your answer. Even so, I think this has to be taken in consideration mostly when doing research.

Other edge cases like handling of NaNs have been made consistent in recent commits. But for equal values, there is no easy solution that would make them consistent without loosing (a lot) performance wise.

1 Like