Training Reproducibility Problem

Question about torch training reproducibility.
I set the random seed and cudnn as below:

random.seed(1)
torch.manual_seed(1)
torch.cuda.manual_seed(1)
np.random.seed(1)
torch.backends.cudnn.deterministic = True

It works well in torch v0.4.0, but when I upgraded it to v1.0.0, the training loss became irreproducible.
My environment as below:

  1. torch v0.4.0, Cuda 8.0, Cudnn 6.0.21, Ubuntu 16.04, GPU GTX1070.
  2. torch v1.0.0, Cuda 10.0, Cudnn 7.3.1.20, Ubuntu 18.04, GPU RTX2080TI

Can anyone help me?

3 Likes

Assuming you mean irreproducible, do you mean that you get different results if you run the same code on PyTorch 0.4 vs PyTorch 1.0 or if you run the code >= 2 times on PyTorch 1.0?

Sorry, I mean irreproducible.

Yes, both irreproducible, PyTorch 0.4 vs PyTorch 1.0 and >= 2 times on PyTorch 1.0.

PyTorch 0.4 vs PyTorch 1.0

this should be expected.

>= 2 times on PyTorch 1.0.

This is weird, since you set both random seeds and deterministic cuda behavior. I am currently using the same cards (RTX 2080TIs) and couldn’t find a difference when running code multiple times. Sorry, I have no way to explain this. But one possibility could be that you set the random seeds somewhere later in your script after you initialized the weights?

My code got an error while training, but it still running.

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument

Is this related to the irreproducilibity?

I haven’t seen this before but it might be

hi, can you tell me your environments?
Python? pyTorch?Cuda?Cudnn?Ubuntu?

I have three different machines with all different cards (1080Ti’s, RTX2018’s, Titan V) and never encountered a problem when using conda. Before, I had a custom compiled versions of everything but I noticed that conda was just a) faster to upgrade b) faster to run (probably because of better compiled MKL stuff)

  • Ubuntu 18.04
  • Python 3.7
  • PyTorch 1.0.1 (but PyTorch 0.4 and 1.0 not too long ago worked similarly well)
  • Cuda 10 (and whatever cuDNN version comes with the PyTorch conda installer EDIT: cuDNN 7.4.0.1)

Note that there are known sources of randomness even for this case. The documentation has a section on it.

Best regards

Thomas

1 Like

I got reproducibility when replaced the nn.upsample by nn.PixelShuffle in my model.
I tried nearest and bilinear in nn.upsmple, all of them are training irreproducible.

And nn.ConvTranspose2d is reproducible.

I think upsample should be part of the known irreproducible list.

But it was reproducible in v 0.4.0.
What do you mean “known irreproducible list” ?

you mean here ?
from https://pytorch.org/docs/stable/notes/randomness.html
A number of operations have backwards that use atomicAdd , in particular torch.nn.functional.embedding_bag() , torch.nn.functional.ctc_loss() and many forms of pooling, padding, and sampling. There currently is no simple way of avoiding non-determinism in these functions.

Ha, yeah, I thought I should have added it already.
There also is an open issue for enhancing reproducibility. Everyone is in favour of it,but so far none who wants to do it.

Best regards

Thomas

Which version of Cuda has this problem? From 10.0?
pyTorch v 0.4.0 and Cuda 8.0 seems working well.

As far as I know, it’s not dependent on the CUDA/CuDNN version, but on how the backwards has been implemented. I don’t really know how that changed without digging through the sources.

Best regards

Thomas

Note that there are known sources of randomness even for this case. The documentation has a section on it.

Best regards

Thomas

On that page, it says:

Completely reproducible results are not guaranteed across PyTorch releases, individual commits or different platforms. Furthermore, results need not be reproducible between CPU and GPU executions, even when using identical seeds.

This should be expected though, because cuDNN and PyTorch’s CPU variants use different algorithms for certain operations (e.g., a prominent example would be convolutional operations for which dozens of approximations exist). Also, difference between versions should be expected – could be also attributed to bug fixes etc.

In practice, I find that if I run the same code multiple times with the same cuDNN and PyTorch version (even on different machines; assuming manual seeds for weight init and shuffling are set and cuDNN is set to deterministic), I always get consistent results.

hi, I got another irreproducibility problem.
I downgraded the RTX2080TI platform from torch v1.0.0 to torch v0.4.0, and now it is reproducible when running the code >= 2 times.
But I found that the batchnorm and nn.linear cause irreproducibility between my two platform below.
If I remove the batchnorm and replace the nn.linear by nn.conv2d, it is reproducible.

  1. torch v0.4.0, Cuda 8.0, Cudnn 6.0.21, Ubuntu 16.04, GPU GTX1070.
  2. torch v0.4.0, Cuda 9.0, Cudnn 7.4.2.24, Ubuntu 18.04, GPU RTX2080TI

Actually, both CPU-version and GPU-version are irreproducible.

I only know about PyTorch >=1, but if you have a tensor and get different results when sending it through linear several times, I would be most interested.
Linear and and batch norm are generally believed to be reproducible. (But be certain that you’re not applying updated batch norm statistics.)

Best regards

Thomas