PyTorch result changes on the exact same setting if run sequentially

ilhamfp · March 10, 2020, 11:51pm

The Problem

Even after setting all random factors with seeds, PyTorch result changes on the exact same setting if run sequentially.

To Reproduce

I have created 2 Kaggle Kernel that trains & test PyTorch on the same MNIST dataset five times. The kernels showcase the difference between setting random factors only once vs. each sequential run.

Expected behavior

I expect running the same train & test five times on kernel #1 would produce the exact same result each time. But somehow, something changes on each iteration. I can’t pinpoint what changes on each iteration. Only by resetting the random factor on each iteration (kernel #2) I could achieve the exact same result each time.

Environment

Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: None

OS: Debian GNU/Linux 9 (stretch)
GCC version: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
CMake version: version 3.7.2

Python version: 3.6
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip] msgpack-numpy==0.4.4.3
[pip] numpy==1.18.1
[pip] numpydoc==0.9.2
[pip] pytorch-ignite==0.3.0
[pip] pytorch-pretrained-bert==0.6.2
[pip] pytorch-transformers==1.1.0
[pip] torch==1.4.0
[pip] torchaudio==0.4.0a0+719bcc7
[pip] torchtext==0.5.0
[pip] torchvision==0.5.0
[conda] blas                      1.0                         mkl  
[conda] cpuonly                   1.0                           0    pytorch
[conda] mkl                       2019.3                      199  
[conda] mkl-service               2.0.2            py36h7b6447c_0  
[conda] mkl_fft                   1.0.12           py36ha843d7b_0  
[conda] mkl_random                1.0.2            py36hd81dba3_0  
[conda] pytorch                   1.4.0               py3.6_cpu_0  [cpuonly]  pytorch
[conda] pytorch-ignite            0.3.0                    pypi_0    pypi
[conda] pytorch-pretrained-bert   0.6.2                    pypi_0    pypi
[conda] pytorch-transformers      1.1.0                    pypi_0    pypi
[conda] torchaudio                0.4.0                      py36    pytorch
[conda] torchtext                 0.5.0                    pypi_0    pypi
[conda] torchvision               0.5.0                  py36_cpu  [cpuonly]  pytorch

albanD · March 11, 2020, 12:16am

Hi,

I am reading the code correctly that the difference is between:

for counter in range(1, 6):
    set_seed(seed=RANDOM_SEED)

and

set_seed(seed=RANDOM_SEED)
for counter in range(1, 6):

?

If so, then it is expected as in the second case, when you recreate the net inside the for-loop, since the seed is not set, you get a different initialiazation for your net and so you will get different results.

ilhamfp · March 11, 2020, 1:35am

Hi @albanD,

Thank you for your reply.
Yes, that’s the only difference.

Hmm, that’s really interesting. After reading your reply, I tried to print the 1st conv2d weight at each iteration while only setting seed once. You’re right that the initialization weight changed! Hence the results vary. Link to kernel

But, I tried another experiment. This time I set the initialization weight to a fixed value. To my surprise, even though each iteration started from the same weight initialization, the results still change on each iteration. Link to kernel

So what’s the real culprit?

albanD · March 11, 2020, 6:31pm

I would guess that the order you get the samples in the dataloader will change as well?
Anything that uses randomness will lead to differences if you don’t seed it.

ilhamfp · March 12, 2020, 2:49pm

Hmm, but I do set shuffle=False on the DataLoader for train data.

albanD · March 12, 2020, 3:45pm

I am not sure in your particular code.

Taking a step back. You should not rely on getting the exact same results as depending on the hardware and version you have, the results might change.
This is due to floating points being non-exact and the fact that gradient descent has the tendency to increase any small difference very quickly. Leading to completely different results.
For stable models, you will get similar final loss and accuracy. But you won’t get the exact same solution.