Reproducibility of CUDAExtension

To expand a bit on Alex answer:

  • In general, what you compute by atomicAdd is an “index add”, a sequential version would look like
      for i in range(...):
          foo[key[i]] += value[i]
    
    The non-determinism comes from the fact that the order in which the additions are computed in the atomicAdd is non-deterministic (in contrast to that for loop).
  • Now, an “obvious” way to make this deterministic (and also faster if the number of keys leads to lots of conflicts) is to sort keys and values by key and then have each thread process a segment.
    This can get involved, look at embedding bag backward in the PyTorch source code as an example.
    I once thought I might get crowdfunding to get deterministic kernels for PyTorch, but it didn’t work out. But Kurt Mohler from Quansight implemented a mechanism to flag nondeterminism. I don’t know if he also will write alternatives to the kernels.
  • The nondeterminism comes addition being non-commutative for floats (as in “1e30 + 1 - 1e30 = 0”).
    One conceptually easy option to avoid that could be to quantize the computation to fixed precision. (Compute once to find the maximum modulus and then compute with that precision or so) - integer addition commutes, so atomiAdd for integers is deterministic.
  • If you can live with “enhanced approximate determinism” one option might be to do the atomic add computation in double and then round back to float. The larger precision of double means that the non-determinism might just be rounded away, or at least a large part of it.

Best regards

Thomas

4 Likes