Different training results on different machines | With simplified test code

Art · October 28, 2019, 12:28pm

Hello,
I’m struggling for a few weeks to train a certain complex model on different computers.
The training is running smoothly on my local machine with nice loss and accuracy curves but on the more powerful multy-gpu machine the training is very bad (on the multy-gpu machine - the accuracy, even on the training dataset, doesn’t raise beyond a certain point and the accuracy on the validation dataset is all over the place compared to the results of my local machine).

I installed the same clean conda environment on both machines. The only relevant difference I could find between the machines are different nvidia drivers and different cuda versions.
On the local machine: Cuda compilation tools, release 10.1, V10.1.105
Driver Version: 418.40.04
On the multy-GPU machine: Cuda compilation tools, release 9.2, V9.2.148
Driver Version: 396.54

After very extensive and exhaustive testing I pinned down the problem to one of the models training differently (the other models trained similarly on both machines, only this model was training differently. This is with the same code, same data, same random seeds etc…).

I created a simple code to isolate and reproduce the problem which is posted below:
The outputs I get on the different machines with the exact same code and seeds:
Local Machine output

$ python Test_different_training.py
Results of the forward pass on the first batch is same on both machines:
Same input: tensor([[0.6349, 0.0771, 0.4478],
[0.0277, 0.4497, 0.6643],
[0.4654, 0.3515, 0.3045],
[0.1548, 0.5315, 0.2011],
[0.5183, 0.2718, 0.5145]], device=‘cuda:0’)
Same output tensor([[-0.0634, -0.0145],
[-0.0023, 0.0131],
[-0.0695, -0.0140],
[-0.0643, -0.0045],
[-0.0902, -0.0123]], device=‘cuda:0’, grad_fn=)

Results of the forward pass after 10 batches is diffrent:
Same input: tensor([[0.9786, 0.8589, 0.1811],
[0.3121, 0.1688, 0.7962],
[0.5744, 0.4271, 0.6725],
[0.3887, 0.4706, 0.2278],
[0.2610, 0.0231, 0.3505]], device=‘cuda:0’)
> Different output tensor([[-0.0148, 0.0174],
> [-0.1467, 0.1211],
> [-0.3497, 0.3899],
> [-0.0969, 0.0811],
> [ 0.0293, -0.0280]], device=‘cuda:0’, grad_fn=)

Multy-GPU Machine output

$python Test_different_training.py
Results of the forward pass on the first batch is same on both machines:
Same input: tensor([[0.6349, 0.0771, 0.4478],
[0.0277, 0.4497, 0.6643],
[0.4654, 0.3515, 0.3045],
[0.1548, 0.5315, 0.2011],
[0.5183, 0.2718, 0.5145]], device=‘cuda:0’)
Same output tensor([[-0.0634, -0.0145],
[-0.0023, 0.0131],
[-0.0695, -0.0140],
[-0.0643, -0.0045],
[-0.0902, -0.0123]], device=‘cuda:0’, grad_fn=)

Results of the forward pass after 10 batches is diffrent:
Same input: tensor([[0.9786, 0.8589, 0.1811],
[0.3121, 0.1688, 0.7962],
[0.5744, 0.4271, 0.6725],
[0.3887, 0.4706, 0.2278],
[0.2610, 0.0231, 0.3505]], device=‘cuda:0’)
> Different output tensor([[-1.2169, 1.2565],
> [-0.1790, 0.1972],
> [-0.7532, 0.7486],
> [-0.1011, 0.1140],
> [ 0.0452, -0.0452]], device=‘cuda:0’, grad_fn=)

Code to test:
The input to the model is structured in a weird way but that’s because it’s only one part of the original architecture, maybe that strange structure and the manipulation I perform on that input causes the difference? I hope not.

import numpy as np
import torch
import torch.nn as nn


class RNNModel_classifier(nn.Module):
    def __init__(self, nClasses = 2):
        super(RNNModel_classifier, self).__init__()

        self.conv1_r = nn.Conv1d(in_channels=4096, out_channels=1024, kernel_size=5, padding=1, stride=1)
        self.batchnorm1_r = nn.BatchNorm1d(1024)
        self.conv2_r = nn.Conv1d(in_channels=1024, out_channels=256, kernel_size=7, padding=0, stride=1)
        self.batchnorm2_r = nn.BatchNorm1d(256)
        self.conv3_r = nn.Conv1d(in_channels=256, out_channels=128, kernel_size=9, padding=0, stride=2)
        self.batchnorm3_r = nn.BatchNorm1d(128)
        self.conv4_r = nn.Conv1d(in_channels=128, out_channels=64, kernel_size=7, padding=0, stride=1)
        self.batchnorm4_r = nn.BatchNorm1d(64)
        self.conv5_r = nn.Conv1d(in_channels=64, out_channels=64, kernel_size=2, padding=0, stride=1)
        self.batchnorm5_r = nn.BatchNorm1d(64)

        self.conv1_s = nn.Conv1d(in_channels = 32, out_channels = 32, kernel_size = 3, padding = 0, stride = 1)
        self.batchnorm1_s = nn.BatchNorm1d(32)
        self.conv2_s = nn.Conv1d(in_channels = 32, out_channels = 64, kernel_size = 7, padding = 0, stride = 2)
        self.batchnorm2_s = nn.BatchNorm1d(64)
        self.conv3_s = nn.Conv1d(in_channels = 64, out_channels = 128, kernel_size = 9, padding = 0, stride = 4)
        self.batchnorm3_s = nn.BatchNorm1d(128)
        self.conv4_s = nn.Conv1d(in_channels = 128, out_channels = 128, kernel_size = 17, padding = 0, stride = 8)
        self.batchnorm4_s = nn.BatchNorm1d(128)
        self.conv5_s = nn.Conv1d(in_channels = 128, out_channels = 64, kernel_size = 15, padding = 0, stride = 4)
        self.batchnorm5_s = nn.BatchNorm1d(64)
        self.conv6_s = nn.Conv1d(in_channels = 64, out_channels = 64, kernel_size = 12, padding = 0, stride = 1)
        self.batchnorm6_s = nn.BatchNorm1d(64)

        self.classifier2 = nn.Bilinear(64, 64, nClasses*32)
        self.batchnorm2 = nn.BatchNorm1d(nClasses*32)
        self.classifier3 = nn.Linear(nClasses*32,nClasses*32)
        self.classifier4 = nn.Linear(nClasses*32,nClasses)
        self.init_weights()

    def init_weights(self):
        initrange = 0.1
        self.classifier2.bias.data.fill_(0)
        self.classifier2.weight.data.uniform_(-initrange, initrange)
        self.classifier3.bias.data.fill_(0)
        self.classifier3.weight.data.uniform_(-initrange, initrange)
        self.classifier4.bias.data.fill_(0)
        self.classifier4.weight.data.uniform_(-initrange, initrange)

    def forward(self, hidden, batch_size_of_sample = 1):

        res_r = torch.relu(self.batchnorm1_r(self.conv1_r(hidden[-1].transpose(0, 1).contiguous().view(
            int(hidden[-1].transpose(0, 1).size(0) / batch_size_of_sample), -1, 32))))
        res_r = torch.relu(self.batchnorm2_r(self.conv2_r(res_r)))
        res_r = torch.relu(self.batchnorm3_r(self.conv3_r(res_r)))
        res_r = torch.relu(self.batchnorm4_r(self.conv4_r(res_r)))
        res_r = torch.relu(self.batchnorm5_r(self.conv5_r(res_r)).squeeze(2))
        
        res_s = torch.relu(self.batchnorm1_s(self.conv1_s(hidden[-1].transpose(0, 1).contiguous().view(
            int(hidden[-1].transpose(0, 1).size(0) / batch_size_of_sample), -1, 32).transpose(1, 2).contiguous())))
        res_s = torch.relu(self.batchnorm2_s(self.conv2_s(res_s)))
        res_s = torch.relu(self.batchnorm3_s(self.conv3_s(res_s)))
        res_s = torch.relu(self.batchnorm4_s(self.conv4_s(res_s)))
        res_s = torch.relu(self.batchnorm5_s(self.conv5_s(res_s)))
        res_s = torch.relu(self.batchnorm6_s(self.conv6_s(res_s)).squeeze(2))

        result = self.classifier2(res_r, res_s)
        result = self.classifier3(torch.relu(self.batchnorm2(result)))
        result = self.classifier4(torch.relu(result))

        return result



np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed(42)
batch_number = 10
model_classifier = RNNModel_classifier().cuda()
criterion = nn.CrossEntropyLoss().cuda()
params = model_classifier.parameters()
optimizer = torch.optim.Adam(params, lr=0.1, weight_decay=1.2e-6)
for i in range(10):
    optimizer.zero_grad()
    hidden_size = [1, 4096*batch_number, 32]
    hidden = [torch.rand(hidden_size).cuda()]
    targets = torch.randint(0, 2, [batch_number]).cuda()
    output  = model_classifier(hidden,batch_size_of_sample = 4096)
    if i == 0:
        print("Results of the forward pass on the first batch is same on both machines:")
        print("Same input: ", hidden[-1][0][0:5, 0:3])
        print("Same output", output[0:5])
    loss = criterion(output, targets)
    loss.backward()
    optimizer.step()

output  = model_classifier(hidden,  batch_size_of_sample = 4096)
print()
print("Results of the forward pass after 10 batches is different:")
print("Same input: ", hidden[-1][0][0:5, 0:3])
print("Different output", output[0:5])

JuanFMontesinos · October 28, 2019, 2:05pm

Realize multi gpu training differs a lot from single-gpu training. Batch norm is not shared across gpus. Gradients resulting from each gpu are averaged. There are differences and some people report the same as you.

Besides, some operations based on nvidia library aren’t deterministic even if you set the seed, for instance, interpolation. There is some flag like torch.cuda.benchmark which is recommended to be set to false. Lastly, realize each machine may process floating points with different hardware, thus, getting different results.

import torch
a=torch.rand(1)*10
import numpy as np
b=a.cuda(1)**np.pi
c=a**np.pi
print('Cuda {0:10f} CPU {1:10f}'.format(b.item(),c.item()))
Cuda 731.980347 CPU 731.980286

Art · October 28, 2019, 3:26pm

Thanks for the answer.
Although I’m not sure if this could explain the problematic behavior.

The code runs on a single GPU in both cases. (the multi-gpu code didn’t train properly so I tried single-gpu training on both machines and got different results, and the test-code posted here is a simple single-gpu script)
In the full training script I checked the other, more complex, model and it trained the same in both machines. Only one model showed difference during training and it’s the model detailed in the test code.
The differences in the results I get are not insignificant (like the differences in float precision) so I think it’s not the cause either.

Note: It seems that the random input in my test code changes with different pytorch versions. The results here are for pytorch 1.1.0

The output of the multi-GPU with pytorch 1.2.0 (bad training):

$ python Test_different_training.py
Results of the forward pass on the first batch is same on both machines:
Same input: tensor([[0.0807, 0.0398, 0.8724],
[0.3084, 0.7438, 0.3201],
[0.8189, 0.6380, 0.3528],
[0.9787, 0.5305, 0.4797],
[0.9665, 0.9392, 0.7120]], device=‘cuda:0’)
Same output tensor([[-0.3300, 0.0380],
[-0.0558, 0.1243],
[-0.0914, 0.0408],
[-0.1176, 0.0539],
[-0.1128, 0.2583]], device=‘cuda:0’, grad_fn=)

Results of the forward pass after 10 batches is diffrent:
Same input: tensor([[0.8217, 0.3160, 0.9787],
[0.7174, 0.7287, 0.3107],
[0.2931, 0.7804, 0.8234],
[0.5223, 0.7749, 0.5192],
[0.2176, 0.2366, 0.8337]], device=‘cuda:0’)
Different output tensor([[-0.1803, -0.1522],
[-0.1187, 0.0910],
[ 0.0831, -0.0918],
[-0.0440, 0.0358],
[ 0.1193, -0.3293]], device=‘cuda:0’, grad_fn=)

albanD · October 28, 2019, 3:33pm

Hi,

In general, setting the seed will give you consistent random number for a given version of pytorch.

Having reproducible results on different hardware is (almost) impossible as different hardware will handle floating point ops differently.
The problem with training with nn training is that, one layer gives an error which is at the level of floating points errors. The next layers will amplify this error until you compute the loss that will be slightly different (depends on the depth). The backward pass will amplify these errors again and the computed gradients will be slightly different. Finally the new weights after the gradient update will be significantly different. If you repeat this 10 times with 10 batches of data, you get results that are significantly different.
Such differences are expected and well designed neural networks have a stable enough training that such problems do not matter.

Art · October 28, 2019, 3:49pm

I can totally agree with that.
But it doesn’t fit well with the fact that the full script trains well on one machine but produces significantly worse results on another machine with exactly the same code/data. So from your answer one could assume that my architecture is not stable enough.
But then again, I tried different parameters and hyper-parameters and the behavior is the same, decent training on one machine and bad training on the other.

albanD · October 28, 2019, 4:12pm

Make sure that on the first machine, you don’t set the seed for all the runs. Because that could make it look stable, but you are effectively doing the same training every time on that machine.

Otherwise, GPU computations can have different behaviors leading to lower precision in our setting. Does CPU versions of both machines work similarly?
If even the CPU version are different, I would double check that the dataset/other parameters are indeed the same between the machines.

Ashima_Garg · June 30, 2020, 10:59am

Isn’t this problem independent of the networks? How some “well-designed neural networks” can avoid this problem? Can you elaborate on these networks so that I can avoid using those sub-parts in my neural network?

albanD · June 30, 2020, 2:57pm

What I mean here is that 32 bit float have a very bad precision for deep networks. The 1e-6 error it makes all the time very quickly grows and can lead to fairly noisy gradients.
But the networks and optimizers we use are not sensible to such noise and are still able to converge to high quality solution (even though they might be different, they are all of similar quality).

But other networks structure/optimizers can actually be very sensible to such noise and thus will have trouble converging in general (you can check the early neural computer papers for example that were converging for a very small number of random seed).

CalinTimbus · November 11, 2020, 6:50am

I am experiencing exactly the same issue as you do, but when I am training a segmentation network, a UNET.

I have a system with CUDA 10.1, Nvidia Driver 455, GTX 1080 Ti with PyTorch 1.6, where the training runs successfully.

On the other hand, I have another system with CUDA 10.1, Nvidia Driver 418.35 (the first driver shipped with CUDA 10.1) and GTX 2080 TI with PyTorch 1.6. The training does not converge, let alone the results on the validation set,

I copied the same exact code from my local machine(with good results) to the other machines with no good results.

It is a very strange behaviour and I do not understand why it happens. I understand and know that different convergence paths can be obtained and perfect reproducibility can be obtained only by providing the same seeds, but I still do not understand how such a phenomenon is possible.

@Art did you manage to solve this problem? If so, would you be so kind as to give me an explanation how you did solve it?

Art · November 11, 2020, 8:34am

Hi Calin, unfortunately I was not able to solve the problem. In my case the issue wasn’t non-convergence on one of the machine but rather just different numbers. Though you can see in the post by albanD he mentions that in some settings models won’t be sensible to such noise and will still be able to converge while in other settings this noise can make or brake a model and sometimes it could mean models won’t converge.
I’m not sure if in your case the convergence problem is due to such noise or some other difference caused by different versions.
I would definitely try installing the same version drivers and all other python packages and see if the problem persists, then perhaps change some training hyperparameters that could help with “smoothing” out this noise and helping the model to converge in both cases (probably at the cost of training speed/efficiency).

CalinTimbus · November 11, 2020, 8:57am

In my case the dataset is extremely imbalanced (99.82% background and 3 other classes which make up for the remaining 0.18%), and I think this phenomenon takes place due to the inherent dataset imbalance present

Very curious though that UNET + specific encoder works on Windows with updated driver but on Linux fails.

Art · November 11, 2020, 9:22am

That’s very interesting/frustrating/strange. The only explanation that comes to mind is this “noise” difference between different machines but I wouldn’t be surprised if something else system/kernel related is going on. There are also some works popping up about “lottery ticket” and lucky convergence and things of the same manner, but I still didn’t see any peer reviewed explanation about inherent differences between different machines/systems.

CalinTimbus · November 11, 2020, 5:25pm

Thing is that on Windows 10 CUDA 10.1 with PyTorch 1.6 the UNET with different backbones seems to always start to converge after a point, while on Linux it always fails miserably.

I was starting to think that something is wrong with my code.

Nice+Sad to see similar phenomenon happening.

Thank you for the discussion and good luck in your endeavours.

Hassan_Imani · January 1, 2021, 11:28pm

Dear @albanD, @CalinTimbus and @Art, did you solve the problem?
I am having exactly the same problem.

albanD · January 6, 2021, 3:49pm

Hi,

There isn’t any problem to solve as the behavior here is expected as mentioned in the citation you have.
You can check the note on reproducibility for more details here: Reproducibility — PyTorch 1.7.0 documentation

Hassan_Imani · January 6, 2021, 7:10pm

There is a problem that is not solved, and I think there are a lot of cases that have problem that “working in keras but not in pytorch”!
My code is working in keras in both PCs, but one of them working in pytorch the other not!
I lost my hope here!
Thank you for useful comments.

albanD · January 6, 2021, 7:12pm

You would need to give more details about what you mean by “not working” as well as what your code is doing and what is expected or not for us to be able to help you.
I would recommend you open a brand new topic since it is most likely not related to the discussion in this one.

Hassan_Imani · January 6, 2021, 7:13pm

I created this:

toyaji · September 18, 2021, 11:48pm

I am struggling with the same issue. My non-local net converges with my local machine Windows 10 CUDA 11.2 RTX3060 with Pytorch 1.9. But do not converges in Linux machine, CUDA 11.2 RTX3090 with Pytorch 1.9 . What happens here? Is there anybody who knows how to fix it?

Nurmukhamed_Ubaidull · November 5, 2021, 2:15pm

I also have the same problem. I have three identical machines with shared home directory, where source code lives. In two of them I got good training results, at least low training and validation loss. But on third machine have low training loss and very big validation loss. Code, data is the same.