Manual Seed Non-deterministic on Different GPUs

Hello community,

I train on a cluster gpu-farm of Tesla V100s, the scheduling system will assign me to the next available gpu, I have no control over that.

One issue I realized recently is that by manual seeding both numpy and pytorch, I got consistent random numbers from numpy but different random numbers from pytorch across different GPUs. (correction: I just realized numpy random numbers are generated by CPUs.)

The manual seed to get matching random numbers passes on my local card and one particular V100 card that I could access. So I wonder if this behavior from pytorch is by design or a bug?

Thanks!

The manual seed to get matching random numbers passes on my local card and one particular V100 card that I could access. So I wonder if this behavior from pytorch is by design or a bug?

You should get consistent random numbers if you’re using the same seed, PyTorch version, and CUDA version even if it’s run on a different physical GPU. For example:

python -c "import torch; torch.manual_seed(1); print(torch.randn(1, device='cuda'))"

The CPU and GPU random number generators are different and will generate different streams of numbers. Also the PyTorch CPU generator is different from the NumPy generator.

3 Likes

Thank you @colesbury, I posted a reply earlier saying I found torch.randn() produces consistent results but inconsistent initialization parameters inside modules.

Then I realized I was testing different module designs… Indeed everything is consistent across different hardware.

One other question though, previously I used the following line to check for consistent random number yet without specifying device='cuda' I got inconsistent resutls. What do you think is the reason?

print(f'testing torch random seed {torch.randint(1, 99, (5,))}')

Thanks again!