Randn output differs locally vs Google Colab instance with same seed

peterdn · September 14, 2022, 9:06pm

I’m very new to PyTorch, coming at it from playing with Stable Diffusion. A couple of days ago I noticed that I was no longer able to reproduce SD output on my laptop that I was getting in Google Colab, despite using the same seed and other parameters. Furthermore and confusingly, this appears to have changed on my laptop somehow: an image I was previously able to reproduce now looks different when I generate it locally.

I’ve tracked down the difference to the output of torch.randn(N, device=cuda). Locally vs on Colab, the sequence begins to diverge after N=10240. That is, every value from index 0…10239 is identical. Value at index 10240 is different, and from there on the sequence diverges.

Is this expected or unusual? Are there any obvious reasons why this would suddenly change without any known change to my hardware or software? I have since upgraded my drivers to the latest and am still getting the (same) diverging output.

To verify I used the following small script:

import torch
cuda = torch.device("cuda")
torch.manual_seed(12345)
torch.randn(10241, device=cuda)

Locally this outputs:

tensor([ 0.5786, -0.5248, -0.2919,  ..., -0.2040, -1.8688,  0.2480],
       device='cuda:0')

On Colab (note the last element is different):

tensor([ 0.5786, -0.5248, -0.2919,  ..., -0.2040, -1.8688, -0.1346],
       device='cuda:0')

I’m running Windows 10 x64, GeForce GTX 950M, Core i7-6700HQ.

gug · September 14, 2022, 9:54pm

Hey there,

I am not expert in this but, One possible reasons might be the different GPUs/cuda version on colab.

you can check the GPU type of your colab environment by !nvidia-smi command.

Edit:- found a similar post

KFrank · September 15, 2022, 1:28am

Hi Peter!

peterdn:

Locally this outputs:

tensor([ 0.5786, -0.5248, -0.2919,  ..., -0.2040, -1.8688,  0.2480],
       device='cuda:0')

On Colab (note the last element is different):

tensor([ 0.5786, -0.5248, -0.2919,  ..., -0.2040, -1.8688, -0.1346],
       device='cuda:0')

For what it’s worth I reproduce your Colab result running on my system.

>>> import torch
>>> torch.__version__
'1.12.0'
>>> torch.version.cuda
'11.6'
>>> torch.cuda.get_device_name()
'GeForce GTX 1050 Ti'
>>> cuda = torch.device ("cuda")
>>> torch.manual_seed (12345)
<torch._C.Generator object at 0x7fd8924da410>
>>> torch.randn (10241, device=cuda)
tensor([ 0.5786, -0.5248, -0.2919,  ..., -0.2040, -1.8688, -0.1346],
       device='cuda:0')

I don’t have an explanation and it does seem like a minor bug, although
pytorch explicitly warns that it is not guaranteed to be exactly reproducible
across platforms.

(I do note that 10241 is one more than 10 * 2**10, although that’s hardly
a good reason to go off the rails …)

Best.

K. Frank

peterdn · September 15, 2022, 9:16pm

Thanks both!

Is it worth me reporting this as an issue on GitHub, or does the explicit lack of guarantee basically mean this is not-unexpected behaviour?

ptrblck · September 15, 2022, 9:18pm

It’s expected behavior as there is no guarantee for deterministic behavior between different devices and generally setups.