Hi,
I observed that when I run my code on different GPUs my errors change. For example, for some metric I get 1.0056 on one machine and 1.0273 on the other. The first GPU is a GTX TITAN X and the second a Tesla K40c. Driver version is 390.48 and other relevant versions:
pytorch 1.0.0 py3.7_cuda9.0.176_cudnn7.4.1_1
numpy 1.15.4 py37h7e9f1db_0
numpy-base 1.15.4 py37hde5b4d6_0
python 3.7.2 h0371630_0
torchvision 0.2.1 py_2
OS: 4.13.0-36-generic #40~16.04.1-Ubuntu
I can’t find the version of cudatoolkit on this machine. Could it be that is not installed? Is it even required?
All random seeds are set to 0, cuda benchmark is disabled and deterministic is set to True. If I run the model 10 times on the same GPU I get 10x exactly the same result. However, not if I change the GPU. Note that the machine is the same, only the GPU model changes. If I run it 2x on differnt TITAN X I also get the same result, but the results differ between the K40c and the TITAN X.
What could the reason for this behaviour be and is there something I can do to get the same results on different GPUs?
EDIT:
The model was trained only once and is now evaluated on different GPUs (the exact same weights are used). The model is doing backprop also during inference (generative model). I am using Adam and the loss function contains some logarithms. Could there be some numerical instabilities that cause this big difference?
EDIT 2:
I also have a machine on AWS with a Tesla K80 and tested it there. The result is 100% the same as with the Tesla K40 even though the driver is much newer. So I am gussing it is some numerical differences due to the implementation of the hardware?
EDIT 3:
Seems that the K40 and K80 have double precision while the TITAN X does not? That could explain the differences. But how can this be solved? I need the same results on each GPU