RuntimeError: scatter_add_cuda_kernel does not have a deterministic implementation

I am running a RNN (GRU) network on Cuda on colab, getting this error.

RuntimeError: scatter_add_cuda_kernel does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation if that's acceptable for your application. You can also file an issue at to help us prioritize adding deterministic support for this operation.

I also tried the solution mentioned on official Pytorch doc

On CUDA 10.1, set environment variable `CUDA_LAUNCH_BLOCKING=1`. This may affect performance.

On CUDA 10.2 or later, set environment variable (note the leading colon symbol) `CUBLAS_WORKSPACE_CONFIG=:16:8` or `CUBLAS_WORKSPACE_CONFIG=:4096:2`.

But getting the same error.

There is no error while running the same code on the CPU.

What is a possible solution to this error?

Cuda version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1

The feature request is tracked here. As a workaround you could use the CPU implementation or try to replace scatter_add with another indexing operation, if possible. In any case, feel free to update the linked issue with your use case.