Inconsistent reproducibility

louisabraham · September 7, 2021, 11:40am

I have a deterministic code that gives the same result between runs on the same machine.

I tried to launch it on two machines:

my macbook pro m1 with intel python 3.8 and torch=1.9.0
google colab with python 3.7 and torch=1.9.0+cu102

It gives different results in most cases.

However, if I set the optimizer to Adam and use 10 epochs at most, the results are the same.
Even weirder, if I use CUDA on colab (which needs torch.use_deterministic_algorithms(False) but the code still gives the same results every time), then I can do 109 epochs at most that give the same results between the two machines.
If I use SGD, then using CPU on colab works for 9 epochs at most and GPU doesn’t work when any epoch is done.

I couldn’t make a MRO, here is the code. It is hard to dive down because I make most operations using a package. But anyone can probably try to run it and report their results that are different. Also, when relaunching I cannot get the outputs to match as I explain above.

The notebook gives rrmse: 1.179134459111933 and my computer gives rrmse: 1.1791344531807701.

Does anyone have an idea?

ptrblck · September 8, 2021, 5:28am

There is no guarantee to get deterministic results on different hardware architectures, since e.g. the pseudorandom number generator implementation could be different.

I don’t quite understand this claim. Since you are not using deterministic algorithms, you shouldn’t expect to get deterministic results.

louisabraham · September 8, 2021, 7:48am

Oh, right, I didn’t read the reproducibility docs right:

Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms

About use_deterministic_algorithms, here is my traceback. Looks like linear is non deterministic on some cuda versions:

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/linear.py in forward(self, input)
     94 
     95     def forward(self, input: Tensor) -> Tensor:
---> 96         return F.linear(input, self.weight, self.bias)
     97 
     98     def extra_repr(self) -> str:

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in linear(input, weight, bias)
   1845     if has_torch_function_variadic(input, weight):
   1846         return handle_torch_function(linear, (input, weight), input, weight, bias=bias)
-> 1847     return torch._C._nn.linear(input, weight, bias)
   1848 
   1849 

RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

Anyway, you already answered my main question, thanks!

ptrblck · September 8, 2021, 7:49am

Yes, as the error message explains you would need to set the cublas workspace sizes to these fixed values to get deterministic results in newer cublas releases.