F.grid_sample non-deterministic backward results

I test the backward of grid sample function and find that the gradient of input is non-deterministic even with cudnn.deterministic = True or cudnn.enabled = False:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.backends import cudnn

cudnn.deterministic = True
l1_loss = nn.L1Loss()

def grid_sample_test(seed):
    grid = torch.randn(32, 32, 32, 2).cuda().requires_grad_()
    inputs = torch.randn(32, 3, 32, 32).cuda().requires_grad_()
    outputs = F.grid_sample(inputs, grid, mode='bilinear', padding_mode='zeros')
    targets = torch.randn(32, 3, 32, 32).cuda()
    loss = l1_loss(outputs, targets)
    return inputs.grad, grid.grad

# cudnn.enabled = False
for i in range(10):
    input_grad_1, grid_grad_1 = grid_sample_test(i)
    input_grad_2, grid_grad_2 = grid_sample_test(i)
    print(i, torch.equal(input_grad_1, input_grad_2), torch.equal(grid_grad_1, grid_grad_2))

Based on my understanding, the backward of F.grid_sample is non-deterministic when cudnn is enabled since cudnnSpatialTfSamplerBackward is non-deterministic according to the nvidia cudnn docs.
But why is it still non-deterministic even disabling cudnn?
However, the gradient of grid is deterministic in both cases.

I really hope that I can obtain the deterministic behavior from F.grid_sample. Needing help~

The backward of grid_sample has atomicAdd from multiple threads though, and it is a bit nondeterministic with floating point types.

Is there any workaround to make it deterministic?

It’s multithreaded, so unfortunately no

I add a line in the for loop

print(torch.sum(input_grad_1 - input_grad_2)/len(input_grad_1))

the average grad variance is at 1e-12~-14, I guess it not a big problem? ( for me it is)

and May I ask what is the use to calculate grad on grid?

I just want to get exact reproducibility, which needs to remove all the randomness.
I use grid_sample for video prediction. Practically, the grad variance is not a big problem for me.