Nondeterminstic behaviour of sigmoid and tanh in JIT script

I recently discovered some nondeterministic behaviour when using a jit script which is supposed to make the usage of torch.sigmoid more memory efficient, as described in Performance Tuning Guide — PyTorch Tutorials 2.2.0+cu121 documentation .

When executed on CPU both torch.tanh and torch.sigmoid lead to different results for the same input when comparing the first and second invocation. After that all invocations lead to the same result. The mean average error between first and second result is around 1e-11, but when used in a simple convolutional network with only 4 layers, this quickly becomes 1e-8.

This does not happen on GPU.

The fact that it stays the same result after the second invocation makes me think, it’s some sort of caching problem, but I wonder, why the first result wouldn’t be cached.

Has someone ever experienced something similar? Is there a way to fix this? Would it be affected by setting torch.use_deterministic_algorithms? I haven’t tried that yet…

Here is the jit script I invoke:

@torch.jit.script
def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels, debug_name: str="None"):
    n_channels_int = n_channels[0]
    in_act = input_a + input_b
    t_act = torch.tanh(in_act[:, :n_channels_int, :])
    s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
    acts = t_act * s_act

And this is the MAE between first and second invocation in the same input (in_act) in the same layer (0) in network WN (forward and backward have no meaning here, except for that backward holds the second invocation):

torch.jit.script is able to optimize the computation graph and add e.g. kernel fusions, which might not output bitwise-identical results compared to the eager execution. I don’t know what exactly is used on the CPU, but note that TorchScript is in maintenance mode and the current recommendation is to use torch.compile. Also note that torch.compile might still yield the same mismatches due to optimizations.