Inconsistent reproducibility

I have a deterministic code that gives the same result between runs on the same machine.

I tried to launch it on two machines:

  • my macbook pro m1 with intel python 3.8 and torch=1.9.0
  • google colab with python 3.7 and torch=1.9.0+cu102

It gives different results in most cases.

However, if I set the optimizer to Adam and use 10 epochs at most, the results are the same.
Even weirder, if I use CUDA on colab (which needs torch.use_deterministic_algorithms(False) but the code still gives the same results every time), then I can do 109 epochs at most that give the same results between the two machines.
If I use SGD, then using CPU on colab works for 9 epochs at most and GPU doesn’t work when any epoch is done.

I couldn’t make a MRO, here is the code. It is hard to dive down because I make most operations using a package. But anyone can probably try to run it and report their results that are different. Also, when relaunching I cannot get the outputs to match as I explain above.

The notebook gives rrmse: 1.179134459111933 and my computer gives rrmse: 1.1791344531807701.

Does anyone have an idea?

There is no guarantee to get deterministic results on different hardware architectures, since e.g. the pseudorandom number generator implementation could be different.

I don’t quite understand this claim. Since you are not using deterministic algorithms, you shouldn’t expect to get deterministic results.

Oh, right, I didn’t read the reproducibility docs right:

Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms

About use_deterministic_algorithms, here is my traceback. Looks like linear is non deterministic on some cuda versions:

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/linear.py in forward(self, input)
     94 
     95     def forward(self, input: Tensor) -> Tensor:
---> 96         return F.linear(input, self.weight, self.bias)
     97 
     98     def extra_repr(self) -> str:

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in linear(input, weight, bias)
   1845     if has_torch_function_variadic(input, weight):
   1846         return handle_torch_function(linear, (input, weight), input, weight, bias=bias)
-> 1847     return torch._C._nn.linear(input, weight, bias)
   1848 
   1849 

RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

Anyway, you already answered my main question, thanks!

Yes, as the error message explains you would need to set the cublas workspace sizes to these fixed values to get deterministic results in newer cublas releases.