Deterministic/non-deterministic results with PyTorch

Multi-threading without barriers/sync/fences is non-deterministic.

Just sending a console command on one CPU will make a thread slower and finish later.

Anytime something is using those results “just-in-time” as they come (like data-loaders and many optimized kernels) you will have those non-deterministic issues.

The big issue

The biggest issue with that is catastrophic cancellation in my opinion. An example of this is ln(1 + x) if x << 1 (really small, say 1e-16), the result should be ~x, except that due to the order of computation and precision loss the computer will do ln(1) and gives you 0.

Now imagine you don’t do ln(1 + x) but ln(1 + x1 + x2 + x3 + x4), depending on the order of computation you get x1 + x2 + x3 + x4, maybe you will get 0, or maybe x1 + x2 + x3 + x4 will be significant enough so that there is no catastrophic cancellation. This may cause great disturbance in the force… sorry results.

Why do we use non-deterministic kernels

Because memory barriers/fences/synchronization is quite costly, most deep learning operations are memory-bound. You can check the “roofline model” to know more about CPU-bound vs memory-bound.

Illustration: on CPU adding 2 tensors will give you at most 20% of the theoretical max GFLOPS of your CPU because CPUs are computing faster than they can get data.

On GPU, the 1080ti is faster than the Titan X Pascal for DL workloads because its memory bandwidth is 11GBps instead of “just” 10GBps.

Everyone is comparing frameworks by speed (and Soumith being the first :wink: ) so reducing memory bandwidth is not desirable from a marketing point of view at least.

End note: There is a log1p function in C that avoids catastrophic cancellation of ln(1 +x) by the way ;).

4 Likes