Reproducibility and floating point arithmetics with(out) AVX512

Hello,

I am trying to have reproducible calculation with pytorch across multiple computers and I managed to do almost that except that I did not manage to have the same results on a computer that support AVX512 and one that does not.

Basically, if I try to execute the following code

import numpy as np
import torch
import torch.nn as nn
import random
torch.use_deterministic_algorithms(True)

seed = 42

random.seed(seed)
torch.manual_seed(seed)
np.random.seed(seed)

device = torch.device("cpu")

def layer_init(layer, std=np.sqrt(2), bias_const=0.0):
    torch.nn.init.orthogonal_(layer.weight, std)
    torch.nn.init.constant_(layer.bias, bias_const)
    return layer

network = nn.Sequential(layer_init(nn.Linear(100, 100)),
                        layer_init(nn.Linear(100,1), std=1.0),).to(device)
with torch.no_grad():
    action = network(torch.rand(size=(100,)).to(device))

print(format(action.cpu().numpy()[0].item(), '.60g'))

Result on AVX512 computer is 0.4804637432098388671875 on non-AVX512
computer I get 0.4804628789424896240234375, a difference of order 1e-5.

I understand that this difference is expected due to floating point calculation, my problem is that there should be 0 difference when specifying that AVX512 not be used (i.e. using ATEN_CPU_CAPABILITY=avx2), if the accelerations used are the same on both computers there should not be any difference!
However, even when asking to use only AVX2 the two computers still give different results. Do you have any idea on how to not use AVX512 on an AVX512 compatible computer in order to be reproducible?

This is a continuation of this github issue in which I already explained my problem and was advised to comehere.

Thanks

Could you leave no to this claim in the AVX docs or explain why this should be the case?

It is not in the doc, this just seems to be obvious to me that same code and libraries and environment should give same result, but maybe it is not obvious. Reproducibility is indeed hard to achieve in general.

But then my question is why are the results different?

For now, the only way I found to reproduce the results was to create a vm with AVX512 deactivated and then I manage to get reproducibility across several computers, so I believe it is not a hardware problem.

Due to this experiment with a vm, I believe this means that AVX512 is still used even though ATEN_CPU_CAPABILITY=avx2 is set, is it normal?

I would expect the opposite: different HW specs and features could use different algorithms with the same instruction set, so unless bitwise-identical results are guaranteed and documented for different HW platforms I would not expect it.