Training Reproducibility Problem

baiyu0120 · February 14, 2019, 6:02am

Question about torch training reproducibility.
I set the random seed and cudnn as below:

random.seed(1)
torch.manual_seed(1)
torch.cuda.manual_seed(1)
np.random.seed(1)
torch.backends.cudnn.deterministic = True

It works well in torch v0.4.0, but when I upgraded it to v1.0.0, the training loss became irreproducible.
My environment as below:

torch v0.4.0, Cuda 8.0, Cudnn 6.0.21, Ubuntu 16.04, GPU GTX1070.
torch v1.0.0, Cuda 10.0, Cudnn 7.3.1.20, Ubuntu 18.04, GPU RTX2080TI

Can anyone help me?

rasbt · February 14, 2019, 6:05am

Assuming you mean irreproducible, do you mean that you get different results if you run the same code on PyTorch 0.4 vs PyTorch 1.0 or if you run the code >= 2 times on PyTorch 1.0?

baiyu0120 · February 14, 2019, 6:10am

Sorry, I mean irreproducible.

Yes, both irreproducible, PyTorch 0.4 vs PyTorch 1.0 and >= 2 times on PyTorch 1.0.

rasbt · February 14, 2019, 6:13am

PyTorch 0.4 vs PyTorch 1.0

this should be expected.

>= 2 times on PyTorch 1.0.

This is weird, since you set both random seeds and deterministic cuda behavior. I am currently using the same cards (RTX 2080TIs) and couldn’t find a difference when running code multiple times. Sorry, I have no way to explain this. But one possibility could be that you set the random seeds somewhere later in your script after you initialized the weights?

baiyu0120 · February 14, 2019, 6:54am

My code got an error while training, but it still running.

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument

Is this related to the irreproducilibity?

rasbt · February 14, 2019, 4:27pm

I haven’t seen this before but it might be

baiyu0120 · February 19, 2019, 2:33am

hi, can you tell me your environments?
Python? pyTorch?Cuda？Cudnn?Ubuntu?

rasbt · February 19, 2019, 3:03am

I have three different machines with all different cards (1080Ti’s, RTX2018’s, Titan V) and never encountered a problem when using conda. Before, I had a custom compiled versions of everything but I noticed that conda was just a) faster to upgrade b) faster to run (probably because of better compiled MKL stuff)

Ubuntu 18.04
Python 3.7
PyTorch 1.0.1 (but PyTorch 0.4 and 1.0 not too long ago worked similarly well)
Cuda 10 (and whatever cuDNN version comes with the PyTorch conda installer EDIT: cuDNN 7.4.0.1)

tom · February 19, 2019, 5:19am

Note that there are known sources of randomness even for this case. The documentation has a section on it.

Best regards

Thomas

baiyu0120 · February 19, 2019, 5:50am

I got reproducibility when replaced the nn.upsample by nn.PixelShuffle in my model.
I tried nearest and bilinear in nn.upsmple, all of them are training irreproducible.

baiyu0120 · February 19, 2019, 6:40am

And nn.ConvTranspose2d is reproducible.

tom · February 19, 2019, 8:26am

I think upsample should be part of the known irreproducible list.

baiyu0120 · February 19, 2019, 9:03am

But it was reproducible in v 0.4.0.
What do you mean “known irreproducible list” ?

baiyu0120 · February 19, 2019, 9:22am

you mean here ?
from https://pytorch.org/docs/stable/notes/randomness.html
A number of operations have backwards that use atomicAdd , in particular torch.nn.functional.embedding_bag() , torch.nn.functional.ctc_loss() and many forms of pooling, padding, and sampling. There currently is no simple way of avoiding non-determinism in these functions.

tom · February 19, 2019, 2:46pm

Ha, yeah, I thought I should have added it already.
There also is an open issue for enhancing reproducibility. Everyone is in favour of it,but so far none who wants to do it.

Best regards

Thomas

baiyu0120 · February 20, 2019, 1:39am

Which version of Cuda has this problem? From 10.0?
pyTorch v 0.4.0 and Cuda 8.0 seems working well.

tom · February 20, 2019, 1:49am

As far as I know, it’s not dependent on the CUDA/CuDNN version, but on how the backwards has been implemented. I don’t really know how that changed without digging through the sources.

Best regards

Thomas

rasbt · February 20, 2019, 2:20am

Note that there are known sources of randomness even for this case. The documentation has a section on it.

Best regards

Thomas

On that page, it says:

Completely reproducible results are not guaranteed across PyTorch releases, individual commits or different platforms. Furthermore, results need not be reproducible between CPU and GPU executions, even when using identical seeds.

This should be expected though, because cuDNN and PyTorch’s CPU variants use different algorithms for certain operations (e.g., a prominent example would be convolutional operations for which dozens of approximations exist). Also, difference between versions should be expected – could be also attributed to bug fixes etc.

In practice, I find that if I run the same code multiple times with the same cuDNN and PyTorch version (even on different machines; assuming manual seeds for weight init and shuffling are set and cuDNN is set to deterministic), I always get consistent results.

baiyu0120 · February 26, 2019, 5:42am

hi, I got another irreproducibility problem.
I downgraded the RTX2080TI platform from torch v1.0.0 to torch v0.4.0, and now it is reproducible when running the code >= 2 times.
But I found that the batchnorm and nn.linear cause irreproducibility between my two platform below.
If I remove the batchnorm and replace the nn.linear by nn.conv2d, it is reproducible.

torch v0.4.0, Cuda 8.0, Cudnn 6.0.21, Ubuntu 16.04, GPU GTX1070.
torch v0.4.0, Cuda 9.0, Cudnn 7.4.2.24, Ubuntu 18.04, GPU RTX2080TI

Actually, both CPU-version and GPU-version are irreproducible.

tom · February 26, 2019, 9:59am

I only know about PyTorch >=1, but if you have a tensor and get different results when sending it through linear several times, I would be most interested.
Linear and and batch norm are generally believed to be reproducible. (But be certain that you’re not applying updated batch norm statistics.)

Best regards

Thomas