Unreproducible pseudorandomness using CPU

Amos_Haviv_Hason · July 16, 2024, 9:48pm

We have a cluster of GPU nodes at the university, so consequent runs are most probably executed on different nodes, but all have the same software and package versions.

One of my models started to produce inconsistent results between runs after a recent change.

A fix in one line finally solved the issue, but I have no idea what the issue actually was.

The line prior to the fix:

a = torch.randn(n, m).to(b)

The line after the fix:

a = torch.randn(n, m, dtype=b.dtype, device=b.device)

(b is a float64 tensor on CUDA.)

In both cases I use a seed for everything, but only the second version guarantees reproducible results.

What is going on here? Shouldn’t seeded pseudorandomness on different CPUs yield the same results? Or does it have to do with the casting to float64?

ptrblck · July 17, 2024, 4:18pm

I don’t know if there is a guarantee for different CPU architectures to generate the same random numbers (I would guess not, but we would need to wait for colleagues working on CPU backends to confirm). At least for NVIDIA GPUs, we do not guarantee different devices will create the same random values.
Which CPUs are you using btw.?

Amos_Haviv_Hason · July 18, 2024, 8:48am

I’m not sure how to get this info. I’m guessing all CPUs are x86-64.

However, it seems like this change didn’t really fix the issue; it was probably a coincidence. What happens is that there are some periods in which reproductions happen and some periods in which reproductions don’t happen (even if runs are on the same GPU model but on different nodes), without any change in code in between. Thus, it is most likely an environment-related issue.

What kinds of environment details should I log in order to debug this issue?

In the meantime, I’ll log these details.

ptrblck · July 18, 2024, 1:33pm

This points to a script issue where calls into the PRNG are made without you being aware of it. You could try to re-seed the code at various places to see which section of the code might have created random values behind your back.

Amos_Haviv_Hason · July 18, 2024, 5:45pm

Let me rephrase that:

I’m able to reproduce results multiple times. Then, out of nowhere, and without any change in the code, I’m no longer able to reproduce them. Then, after a few hours, I’m able to reproduce them again. In the periods of failed reproductions, the metrics are different from the git-go.

Amos_Haviv_Hason · August 16, 2024, 5:57pm

So now I experienced it again, but I have more information. The environments are similar in all aspects except for the GPU. In one run it is NVIDIA GeForce RTX 2080 Ti, and in the other it is NVIDIA GeForce RTX 3090 (both are using CUDA 12.1).

Is it possible that different GPUs produce different random sequences for the same seed?

ptrblck · August 16, 2024, 6:26pm

Yes, there is no guarantee to see the same random numbers between different platforms, CPUs, GPUs, etc. as described in the docs.

Amos_Haviv_Hason · August 16, 2024, 6:37pm

Yeah, I read this a number of times, but there is an old answer by Sam Gross, which states: “You should get consistent random numbers if you’re using the same seed, PyTorch version, and CUDA version even if it’s run on a different physical GPU.” Is this statement incorrect then?

ptrblck · August 16, 2024, 6:55pm

Yes, the comment from 2019 is invalid as also seen in this recent post describing different behavior on the CPU.

Amos_Haviv_Hason · August 16, 2024, 7:01pm

Thanks for the clarification. Is it specific to PyTorch, or also related to Python’s standard PRNG?