Assuming you mean irreproducible, do you mean that you get different results if you run the same code on PyTorch 0.4 vs PyTorch 1.0 or if you run the code >= 2 times on PyTorch 1.0?
This is weird, since you set both random seeds and deterministic cuda behavior. I am currently using the same cards (RTX 2080TIs) and couldn’t find a difference when running code multiple times. Sorry, I have no way to explain this. But one possibility could be that you set the random seeds somewhere later in your script after you initialized the weights?
I have three different machines with all different cards (1080Ti’s, RTX2018’s, Titan V) and never encountered a problem when using conda. Before, I had a custom compiled versions of everything but I noticed that conda was just a) faster to upgrade b) faster to run (probably because of better compiled MKL stuff)
Ubuntu 18.04
Python 3.7
PyTorch 1.0.1 (but PyTorch 0.4 and 1.0 not too long ago worked similarly well)
Cuda 10 (and whatever cuDNN version comes with the PyTorch conda installer EDIT: cuDNN 7.4.0.1)
I got reproducibility when replaced the nn.upsample by nn.PixelShuffle in my model.
I tried nearest and bilinear in nn.upsmple, all of them are training irreproducible.
Ha, yeah, I thought I should have added it already.
There also is an open issue for enhancing reproducibility. Everyone is in favour of it,but so far none who wants to do it.
As far as I know, it’s not dependent on the CUDA/CuDNN version, but on how the backwards has been implemented. I don’t really know how that changed without digging through the sources.
Note that there are known sources of randomness even for this case. The documentation has a section on it.
Best regards
Thomas
On that page, it says:
Completely reproducible results are not guaranteed across PyTorch releases, individual commits or different platforms. Furthermore, results need not be reproducible between CPU and GPU executions, even when using identical seeds.
This should be expected though, because cuDNN and PyTorch’s CPU variants use different algorithms for certain operations (e.g., a prominent example would be convolutional operations for which dozens of approximations exist). Also, difference between versions should be expected – could be also attributed to bug fixes etc.
In practice, I find that if I run the same code multiple times with the same cuDNN and PyTorch version (even on different machines; assuming manual seeds for weight init and shuffling are set and cuDNN is set to deterministic), I always get consistent results.
hi, I got another irreproducibility problem.
I downgraded the RTX2080TI platform from torch v1.0.0 to torch v0.4.0, and now it is reproducible when running the code >= 2 times.
But I found that the batchnorm and nn.linear cause irreproducibility between my two platform below.
If I remove the batchnorm and replace the nn.linear by nn.conv2d, it is reproducible.
torch v0.4.0, Cuda 8.0, Cudnn 6.0.21, Ubuntu 16.04, GPU GTX1070.
torch v0.4.0, Cuda 9.0, Cudnn 7.4.2.24, Ubuntu 18.04, GPU RTX2080TI
Actually, both CPU-version and GPU-version are irreproducible.
I only know about PyTorch >=1, but if you have a tensor and get different results when sending it through linear several times, I would be most interested.
Linear and and batch norm are generally believed to be reproducible. (But be certain that you’re not applying updated batch norm statistics.)