Torch.save numerical differences

torch.save() on linux and torch.load() on OSX of the same data is causing discrepancies.

Example:
On Linux (RH), run the following:

x = torch.randn(100)
y = torch.sigmoid(torch.exp(x))
d = {'x':x, 'y':y}
torch.save(d,'.test.dat')

Then, on OSX, run the following:

d = torch.load('.test.dat')
x,y = d['x'], d['y']
y2 = torch.sigmoid(torch.exp(x))
print((y == y2).all())
print((y - y2).sum())

Where I should be expecting True and zeros, I see False and non-zeros.

Python version: 3.7.11
Pytorch version: 1.10.2

Has anyone else seen such a discrepancy?

Depending on the magnitude of the errors you are seeing, you might be running into the expected errors due to the limited floating point precision.

Shouldn’t the loss of precision occur in the same way across platforms (Linux, OSX, Windows, etc.)? Is it by any chance related to this discussion?

Not necessarily, as different (vectorized) code paths could be used depending on the architecture, library support etc.
Yes, I think the linked thread is related in case the abs. errors match the expectation, e.g. ~1e-5for theserandn` values:

x = torch.randn(100, 100)
x1 = x.sum()
x2 = x.sum(0).sum(0)
print((x1 - x2).abs().max())
# > tensor(3.0518e-05)

The fact that computations cannot always be reproduced across platforms even without any stochasticity is pretty surprising, especially in this day and age when scientific reproducibility is a major concern.

In any case, thank you for the illuminating MWE above. Is there any reason to think that computations may not be reproduced even on the same platform? Is there an MWE that can illustrate it? I’m currently struggling with such an issue. I train a (custom) model and save its parameters on Linux, but if I load those parameters on the same platform and with the same model, I still see discrepancies in the model results sometimes. I’ve been trying to create an MWE that captures this phenomenon but I’ve been unable to.

My code has operations like torch.repeat_interleave that according to the documentation may be non-deterministic, so I’ve tried creating MWEs that has such operations but still no luck…

Just to clarify, the above MWE by @ptrblck has two “models”, one for creating x1 and the other for creating x2. My issue is that I have only one model in this sense and running it with the stored parameters that were generated and saved on the same platform (Linux and CPU only, not GPU) is still causing discrepancies. I would like to think that I’ve done enough investigation to rule out causes in the semantics of my code.

Reproducibility can be achieved for the same workloads and setup. However, since different algorithms can be used on different platforms, there won’t be a guarantee to be able to reproduce the same bitwise equal values.

If the same library setup is used (CPU, GPU, MKL, cuDNN, NCCL, cublas, CUDA, PyTorch, etc.) then you should be able to reproduce the same values if you stick to the Reproducibility docs. In particular you would need seed the code and set torch.use_deterministic_algorithms(True). If no deterministic algorithm can be found (even for a performance hit) you should get an error.

As described before, take a look at the docs and make sure you are sticking to the needed steps.

1 Like