Scatter_add gives different outputs at each run

el_samou_samou · April 10, 2019, 1:47pm

Hi,

If I run scatter_add twice in the exact same setting, I end up with a non-negligible gap between both results. This is weird as they should be exactly the same results since they correspond to the exact same operations. Here is a code to reproduce this strange behavior:

import torch


splat_1 = torch.zeros(1, 8, 80).cuda()
splat_2 = torch.zeros(1, 8, 80).cuda()


for i in range(100):
    indices = torch.randint(0, 80, (40000,)).unsqueeze(0).repeat(8, 1).cuda()

    feat = torch.rand(8, 40000).cuda()

    splat_1[0] = splat_1[0].scatter_add(1, indices, feat)
    splat_2[0] = splat_2[0].scatter_add(1, indices, feat)

print(((splat_1-splat_2)**2).sum())

I obtained on average 0.04 squared difference.

Any explanation on this? I assume it is due to some computation error accumulation but since these are the exact same operations, should not the error get accumulated in the exact same way both for splat_1 and splat_2?

Thanks in advance for your help.

Samuel

ratishsp · May 13, 2019, 1:43pm

I think the reason for this is explained in https://pytorch.org/docs/stable/notes/randomness.html.
There are some PyTorch functions that use CUDA functions that can be a source of non-determinism. One class of such CUDA functions are atomic operations, in particular atomicAdd, where the order of parallel additions to the same value is undetermined and, for floating-point variables, a source of variance in the result. PyTorch functions that use atomicAdd in the forward include torch.Tensor.index_add_(), torch.Tensor.scatter_add_(), torch.bincount().
An example of such addition is 1 + 2**100 - 2**100 which can be 1 or 0 depending on the order of operations.

One thing I am unsure about is why we don’t see similar behavior with torch.sum()