I’ve had some trouble in reproducing results, and after reading a few posts there seem to be multiple causes for non-determinism, some which are expected. I’d like to check if I got it right.
Dataloaders with multiple threads: They seem to be problematic, even if randoms seeds are set beforehand. [ref].
cuDNN: Apparently cuDNN seems to have non-deterministic kernels [ref1][ref2]
GPU: Apparently, some reductions are non-deterministic in GPU, even without cuDNN. For instance, floating point addition is not associative, and so the result is affected by the order in which the partial results for each thread are collected affects the result. This is expected behaviour and should not affect the results dramatically [ref].
(Note: deterministic GPU reductions seem to have been added to TensorFlow only recently)
Is there anything I’m missing/got wrong?
Multi-threading without barriers/sync/fences is non-deterministic.
Just sending a console command on one CPU will make a thread slower and finish later.
Anytime something is using those results “just-in-time” as they come (like data-loaders and many optimized kernels) you will have those non-deterministic issues.
The big issue
The biggest issue with that is catastrophic cancellation in my opinion. An example of this is ln(1 + x) if x << 1 (really small, say 1e-16), the result should be ~x, except that due to the order of computation and precision loss the computer will do ln(1) and gives you 0.
Now imagine you don’t do ln(1 + x) but ln(1 + x1 + x2 + x3 + x4), depending on the order of computation you get x1 + x2 + x3 + x4, maybe you will get 0, or maybe x1 + x2 + x3 + x4 will be significant enough so that there is no catastrophic cancellation. This may cause great disturbance in the force… sorry results.
Why do we use non-deterministic kernels
Because memory barriers/fences/synchronization is quite costly, most deep learning operations are memory-bound. You can check the “roofline model” to know more about CPU-bound vs memory-bound.
Illustration: on CPU adding 2 tensors will give you at most 20% of the theoretical max GFLOPS of your CPU because CPUs are computing faster than they can get data.
On GPU, the 1080ti is faster than the Titan X Pascal for DL workloads because its memory bandwidth is 11GBps instead of “just” 10GBps.
Everyone is comparing frameworks by speed (and Soumith being the first ) so reducing memory bandwidth is not desirable from a marketing point of view at least.
End note: There is a log1p function in C that avoids catastrophic cancellation of ln(1 +x) by the way ;).