RuntimeError: scatter_add_cuda_kernel does not have a deterministic implementation

I am running a RNN (GRU) network on Cuda on colab, getting this error.

RuntimeError: scatter_add_cuda_kernel does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.

I also tried the solution mentioned on official Pytorch doc

On CUDA 10.1, set environment variable `CUDA_LAUNCH_BLOCKING=1`. This may affect performance.

On CUDA 10.2 or later, set environment variable (note the leading colon symbol) `CUBLAS_WORKSPACE_CONFIG=:16:8` or `CUBLAS_WORKSPACE_CONFIG=:4096:2`.

But getting the same error.

There is no error while running the same code on the CPU.

What is a possible solution to this error?

Cuda version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1

The feature request is tracked here. As a workaround you could use the CPU implementation or try to replace scatter_add with another indexing operation, if possible. In any case, feel free to update the linked issue with your use case.