I read https://pytorch.org/docs/stable/notes/randomness.html and ran lots of tests already. Unfortunately, my model is not reproducible even when ran in Docker containers on two different platforms. It is of course reproducible on the same platform.
What is the reason for the non-deterministic behavior over different platforms?
How should that be dealt with?
Currently, you seem to ask “can I run my stuff anywhere and expect it to be reproducible” and the answer is just “no”.
The first step to a more detailed answer would be to narrow the problem by a lot.
“I want to do X and I have platforms that are exactly identical except for Y.” could be something that could be worked on. (Note, though, that the general interest seems to be mostly in having reproducibility in the sense of non-random behavior rather than implementing it.)
In the end, the crux is that floating point operations don’t commute (nor does the distributive property work out exactly).Then crux is that things are not guaranteed to run in the exact same order e.g. on different hardware because the difference in hardware might actually mean that things are translated to different GPU kernels or some such.
Thank you @tom.
Trying to explain my stuff more clearly now:
I am trying to reproduce a simple MNIST example on the CPU (later GPU, but that’s a different topic) with the exact same loss on two machines.
The first machine is my personal laptop, the second machine is in the cloud. They do not share the same hardware. However, BOTH machines are pretty much naked and running everything inside a Docker container, which sets up the whole environment. Therefore, the software environment is absolutely the same.
I fixed all possible seeds and get reproducible results on the respective machines, but NOT BETWEEN the machines.
To summarize: the platforms share everything except for the hardware (CPU, RAM, Mainboard).
(I may ask the same thing again with GPUs, which make the problem more difficult, but this post focuses on the CPU).
If researchers want to reproduce any of my findings, they should be able to at least reproduce the CPU result on their platform, no?
I don’t think reproducibility in terms of “exact same floating point number” is reasonably achievable, e.g. AVX vs. non-AVX will make a difference.
If your reproducibility depends on floating point accuracy divergences (sadly a lot of DL stuff does), something bad is going on, anyways.
P.S.: This isn’t a criticism of your work, but my thought regarding the general situation. In numerical analysis the goal is to get stable algorithms (a nontrivial problem e.g. for stiff ODE problems) to avoid blowup of numerical errors.