Different output with the same input when inference on different machines

I need to reproduce the same output on different machines. However, it’s different. There is no dropout layer, and I set torch.backends.cudnn.enabled = False, but still different. Any idea how to solve this?

Maybe you have different version of CUDA and /or PyTorch on the machines?
Can you print the versions?

import torch
import sys
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA VERSION')
from subprocess import call
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print("OS: ", sys.platform)
print("Python: ", sys.version)
print("PyTorch: ", torch.__version__)
print("Numpy: ", np.__version__)

1060

 ('__Python VERSION:', '2.7.6 (default, Oct 26 2016, 20:30:19) \n[GCC 4.8.4]')
 ('__pyTorch VERSION:', '0.2.0_1')
 __CUDA VERSION
 ('__CUDNN VERSION:', 6021)
 ('__Number CUDA Devices:', 1L) 
 ('OS: ', 'linux2')
 ('Python: ', '2.7.6 (default, Oct 26 2016, 20:30:19) \n[GCC 4.8.4]')
 ('PyTorch: ', '0.2.0_1')
 ('Numpy: ', '1.13.1')

GTX TITAN X

('__Python VERSION:', '2.7.5 (default, Sep 15 2016, 22:37:39) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]')
('__pyTorch VERSION:', '0.2.0_1')
__CUDA VERSION
('__CUDNN VERSION:', 6021)
('__Number CUDA Devices:', 1L)
('OS: ', 'linux2')
('Python: ', '2.7.5 (default, Sep 15 2016, 22:37:39) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]')
('PyTorch: ', '0.2.0_1')
('Numpy: ', '1.13.0')

TITAN XP

('__Python VERSION:', '2.7.5 (default, Nov  6 2016, 00:28:07) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]')
('__pyTorch VERSION:', '0.2.0_1')
__CUDA VERSION
('__CUDNN VERSION:', 6021)
('__Number CUDA Devices:', 1L)
('OS: ', 'linux2')
('Python: ', '2.7.5 (default, Nov  6 2016, 00:28:07) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)]')
('PyTorch: ', '0.2.0_1')
('Numpy: ', '1.13.1')

K80

('__Python VERSION:', '2.7.5 (default, Sep 15 2016, 22:37:39) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]')
('__pyTorch VERSION:', '0.2.0_1')
__CUDA VERSION
('__CUDNN VERSION:', 6021)
('__Number CUDA Devices:', 1L)
('OS: ', 'linux2')
('Python: ', '2.7.5 (default, Sep 15 2016, 22:37:39) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]')
('PyTorch: ', '0.2.0_1')
('Numpy: ', '1.13.0')

it seems there’s no big difference, but the outputs are different, e.g. softmax output max difference is 0.05, that is not acceptable in my case.

Have you tried setting random seed?

It seems if they are all titan xp, the output is the same. And I haven’t tried any random seed, what random seed should I set? cuda.seed or anything else?

Try setting theses three seeds

np.random.seed()
torch.manual_seed()
torch.cuda.manual_seed()

I tried, still not the same…

And I’ll try to locate which layer gives the different values.

Can you share the code?


torch.backends.cudnn.enabled = False
np.random.seed(41)
torch.manual_seed(41)
torch.cuda.manual_seed(41)

model = models.models['alex_22']()
model.load_model()
net = model.cuda()

def get_input(n, c, h, w):
    return torch.randn(n, c, h, w)

load = True
# load = False
save_pth = 'tensors.pth' #'no_cudnn_tensors.pth'

saves = {}

if not load:
    a = get_input(1, 3, 127, 127)
    b = get_input(1, 3, 255, 255)
    saves['a'] = a
    saves['b'] = b
else:
    saves = torch.load(save_pth)
    a = saves['a']
    b = saves['b']

def Var(x):
    return Variable(x.cuda())

output = net(Var(a), Var(b))[1].data

b1 = net.forward_one_branch(Var(a), net.conv_r1, net.conv_cls1)[1].data
b2 = net.forward_one_branch(Var(b), net.conv_r2, net.conv_cls2)[1].data

f1 = net.features(Var(a)).data
f2 = net.features(Var(b)).data

if not load:
    saves['o'] = output
    saves['b1'] = b1
    saves['b2'] = b2
    saves['f1'] = f1
    saves['f2'] = f2
    torch.save(saves, save_pth)
    print 'saving'
else:
    o2 = saves['o']
    ob1 = saves['b1']
    ob2 = saves['b2']
    of1 = saves['f1']
    of2 = saves['f2']
    print (o2 - output).abs().max()
    print (b1 - ob1).abs().max()
    print (b2 - ob2).abs().max()
    print (f1 - of1).abs().max()
    print (f2 - of2).abs().max()

use this test code, I run on titan x and 1060,
output is
0.0687821805477
0.0696254000068
0.0968679785728
0.415367662907
0.437078356743
The difference is relative huge.

model part is two inputs feed in x, y, (1,3,127, 127)->(256, 4, 4), (1, 3, 255, 255) -> (255, 20, 20), and correlation two output.

features is a modified alexnet

class AlexNet5(nn.Module):
    def __init__(self):
        super(AlexNet5, self).__init__()
        self.conv1 = nn.Conv2d(3, 96, kernel_size=11, stride=2)
        self.conv2 = nn.Conv2d(96, 256, kernel_size=5)
        self.conv3 = nn.Conv2d(256, 384, kernel_size=3)
        self.conv4 = nn.Conv2d(384, 384, kernel_size=3)
        self.bn1 = nn.BatchNorm2d(96)
        self.bn2 = nn.BatchNorm2d(256)
        self.bn3 = nn.BatchNorm2d(384)
        self.bn4 = nn.BatchNorm2d(384)
        self.conv5 = nn.Conv2d(384, 256, kernel_size=3)
        self.bn5 = nn.BatchNorm2d(256)

        self.feature_size = 256

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.bn1(self.conv1(x)), kernel_size=3, stride=2))
        x = F.relu(F.max_pool2d(self.bn2(self.conv2(x)), kernel_size=3, stride=2))
        x = F.relu(self.bn3(self.conv3(x)))
        x = F.relu(self.bn4(self.conv4(x)))
        x = self.bn5(self.conv5(x))
        return x

And you are running both on the CPU?
I dont see any GPU related tensors, for instance:
X_tensor = Variable(torch.from_numpy(a).cuda())

sorry I edit my last reply, there is part missing last time

Cant seem to find anything strange, if you want, upload a self contained Jupyter notebook to git with the data and I can run it locally to compare the results.

I don’t know why but right now, k80, titan and titan xp’s outputs only differ at most 1e-6, but 1060 got a huge difference at most 0.5. Seems that 1060 I installed http://download.pytorch.org/whl/cu80/torch-0.2.0.post3-cp27-cp27m-manylinux1_x86_64.whl, and others are http://download.pytorch.org/whl/cu80/torch-0.2.0.post3-cp27-cp27mu-manylinux1_x86_64.whl . Can it produce such a huge difference?

Finally I locate the problem, I installed pytorch in 1060 using conda. After changing it to the system python, the output is the same.