Torch.save numerical differences

Santosh_Manicka · February 24, 2022, 9:07pm

torch.save() on linux and torch.load() on OSX of the same data is causing discrepancies.

Example:
On Linux (RH), run the following:

x = torch.randn(100)
y = torch.sigmoid(torch.exp(x))
d = {'x':x, 'y':y}
torch.save(d,'.test.dat')

Then, on OSX, run the following:

d = torch.load('.test.dat')
x,y = d['x'], d['y']
y2 = torch.sigmoid(torch.exp(x))
print((y == y2).all())
print((y - y2).sum())

Where I should be expecting True and zeros, I see False and non-zeros.

Python version: 3.7.11
Pytorch version: 1.10.2

Has anyone else seen such a discrepancy?

ptrblck · February 25, 2022, 5:27am

Depending on the magnitude of the errors you are seeing, you might be running into the expected errors due to the limited floating point precision.

Santosh_Manicka · February 25, 2022, 3:49pm

Shouldn’t the loss of precision occur in the same way across platforms (Linux, OSX, Windows, etc.)? Is it by any chance related to this discussion?

ptrblck · February 25, 2022, 9:29pm

Not necessarily, as different (vectorized) code paths could be used depending on the architecture, library support etc.
Yes, I think the linked thread is related in case the abs. errors match the expectation, e.g. ~1e-5for theserandn` values:

x = torch.randn(100, 100)
x1 = x.sum()
x2 = x.sum(0).sum(0)
print((x1 - x2).abs().max())
# > tensor(3.0518e-05)

Santosh_Manicka · February 28, 2022, 3:49pm

The fact that computations cannot always be reproduced across platforms even without any stochasticity is pretty surprising, especially in this day and age when scientific reproducibility is a major concern.

In any case, thank you for the illuminating MWE above. Is there any reason to think that computations may not be reproduced even on the same platform? Is there an MWE that can illustrate it? I’m currently struggling with such an issue. I train a (custom) model and save its parameters on Linux, but if I load those parameters on the same platform and with the same model, I still see discrepancies in the model results sometimes. I’ve been trying to create an MWE that captures this phenomenon but I’ve been unable to.

My code has operations like torch.repeat_interleave that according to the documentation may be non-deterministic, so I’ve tried creating MWEs that has such operations but still no luck…

Just to clarify, the above MWE by @ptrblck has two “models”, one for creating x1 and the other for creating x2. My issue is that I have only one model in this sense and running it with the stored parameters that were generated and saved on the same platform (Linux and CPU only, not GPU) is still causing discrepancies. I would like to think that I’ve done enough investigation to rule out causes in the semantics of my code.

ptrblck · March 1, 2022, 12:20am

Reproducibility can be achieved for the same workloads and setup. However, since different algorithms can be used on different platforms, there won’t be a guarantee to be able to reproduce the same bitwise equal values.

If the same library setup is used (CPU, GPU, MKL, cuDNN, NCCL, cublas, CUDA, PyTorch, etc.) then you should be able to reproduce the same values if you stick to the Reproducibility docs. In particular you would need seed the code and set torch.use_deterministic_algorithms(True). If no deterministic algorithm can be found (even for a performance hit) you should get an error.

As described before, take a look at the docs and make sure you are sticking to the needed steps.

Santosh_Manicka · May 31, 2024, 12:39pm

Turns out simply using double-precision (64-bit) tensors mitigated the issue to a great extent!