Different results with identical random seed and identical code on different GPUs

Hi,

I observed that when I run my code on different GPUs my errors change. For example, for some metric I get 1.0056 on one machine and 1.0273 on the other. The first GPU is a GTX TITAN X and the second a Tesla K40c. Driver version is 390.48 and other relevant versions:

pytorch 1.0.0 py3.7_cuda9.0.176_cudnn7.4.1_1
numpy 1.15.4 py37h7e9f1db_0
numpy-base 1.15.4 py37hde5b4d6_0
python 3.7.2 h0371630_0
torchvision 0.2.1 py_2
OS: 4.13.0-36-generic #40~16.04.1-Ubuntu

I can’t find the version of cudatoolkit on this machine. Could it be that is not installed? Is it even required?

All random seeds are set to 0, cuda benchmark is disabled and deterministic is set to True. If I run the model 10 times on the same GPU I get 10x exactly the same result. However, not if I change the GPU. Note that the machine is the same, only the GPU model changes. If I run it 2x on differnt TITAN X I also get the same result, but the results differ between the K40c and the TITAN X.

What could the reason for this behaviour be and is there something I can do to get the same results on different GPUs?

EDIT:
The model was trained only once and is now evaluated on different GPUs (the exact same weights are used). The model is doing backprop also during inference (generative model). I am using Adam and the loss function contains some logarithms. Could there be some numerical instabilities that cause this big difference?

EDIT 2:
I also have a machine on AWS with a Tesla K80 and tested it there. The result is 100% the same as with the Tesla K40 even though the driver is much newer. So I am gussing it is some numerical differences due to the implementation of the hardware?

EDIT 3:
Seems that the K40 and K80 have double precision while the TITAN X does not? That could explain the differences. But how can this be solved? I need the same results on each GPU

I found the point in my code that introduced the first differences to the data (and those differences got bigger during the following layers). It is this very short piece:

self.layers = nn.Sequential(OrderedDict([
    ('lin1',  nn.Linear(in_features=2*n_coords, out_features=hidden_size)),
    ('relu1', nn.ReLU()),
    ('lin3',  nn.Linear(in_features=hidden_size, out_features=self.outsize)),
]))

There was a tiny differnce after the 4th or 5th digit after the comma after lin3. The weights and inputs are all torch.float32 tensors. What I did is the following:

Replace

def forward(self, x):
    x = self.layers(x)
    return x

by

def forward(self, x):
    x = x.double()
    self.layers = self.layers.double()
    x = self.layers(x)
    return x.float()

Now I get the exact same result on both GPUs (but this one is different from the previous results I got on the different GPUs)

However, I don’t understand why this happens. When both the inputs and network weights are float32 tensors, no double precision should be used, right? So why does the result differ? Both GPUs are able to process single precision floats. Is there some part of these layers that internally “upgrades” to double precision if available?

Also the rest of the model that follows does also just use float32 tensors and I never convert anything to double. And the output of the following layers are identical, the only part that differs are these few layers that I posted

1 Like