Different training results on different machines | With simplified test code

Hi,

In general, setting the seed will give you consistent random number for a given version of pytorch.

Having reproducible results on different hardware is (almost) impossible as different hardware will handle floating point ops differently.
The problem with training with nn training is that, one layer gives an error which is at the level of floating points errors. The next layers will amplify this error until you compute the loss that will be slightly different (depends on the depth). The backward pass will amplify these errors again and the computed gradients will be slightly different. Finally the new weights after the gradient update will be significantly different. If you repeat this 10 times with 10 batches of data, you get results that are significantly different.
Such differences are expected and well designed neural networks have a stable enough training that such problems do not matter.

2 Likes

I can totally agree with that.
But it doesn’t fit well with the fact that the full script trains well on one machine but produces significantly worse results on another machine with exactly the same code/data. So from your answer one could assume that my architecture is not stable enough.
But then again, I tried different parameters and hyper-parameters and the behavior is the same, decent training on one machine and bad training on the other.

1 Like

Make sure that on the first machine, you don’t set the seed for all the runs. Because that could make it look stable, but you are effectively doing the same training every time on that machine.

Otherwise, GPU computations can have different behaviors leading to lower precision in our setting. Does CPU versions of both machines work similarly?
If even the CPU version are different, I would double check that the dataset/other parameters are indeed the same between the machines.

Isn’t this problem independent of the networks? How some “well-designed neural networks” can avoid this problem? Can you elaborate on these networks so that I can avoid using those sub-parts in my neural network?

What I mean here is that 32 bit float have a very bad precision for deep networks. The 1e-6 error it makes all the time very quickly grows and can lead to fairly noisy gradients.
But the networks and optimizers we use are not sensible to such noise and are still able to converge to high quality solution (even though they might be different, they are all of similar quality).

But other networks structure/optimizers can actually be very sensible to such noise and thus will have trouble converging in general (you can check the early neural computer papers for example that were converging for a very small number of random seed).

1 Like

I am experiencing exactly the same issue as you do, but when I am training a segmentation network, a UNET.

I have a system with CUDA 10.1, Nvidia Driver 455, GTX 1080 Ti with PyTorch 1.6, where the training runs successfully.

On the other hand, I have another system with CUDA 10.1, Nvidia Driver 418.35 (the first driver shipped with CUDA 10.1) and GTX 2080 TI with PyTorch 1.6. The training does not converge, let alone the results on the validation set,

I copied the same exact code from my local machine(with good results) to the other machines with no good results.

It is a very strange behaviour and I do not understand why it happens. I understand and know that different convergence paths can be obtained and perfect reproducibility can be obtained only by providing the same seeds, but I still do not understand how such a phenomenon is possible.

@Art did you manage to solve this problem? If so, would you be so kind as to give me an explanation how you did solve it?

1 Like

Hi Calin, unfortunately I was not able to solve the problem. In my case the issue wasn’t non-convergence on one of the machine but rather just different numbers. Though you can see in the post by albanD he mentions that in some settings models won’t be sensible to such noise and will still be able to converge while in other settings this noise can make or brake a model and sometimes it could mean models won’t converge.
I’m not sure if in your case the convergence problem is due to such noise or some other difference caused by different versions.
I would definitely try installing the same version drivers and all other python packages and see if the problem persists, then perhaps change some training hyperparameters that could help with “smoothing” out this noise and helping the model to converge in both cases (probably at the cost of training speed/efficiency).

1 Like

In my case the dataset is extremely imbalanced (99.82% background and 3 other classes which make up for the remaining 0.18%), and I think this phenomenon takes place due to the inherent dataset imbalance present

Very curious though that UNET + specific encoder works on Windows with updated driver but on Linux fails.

1 Like

That’s very interesting/frustrating/strange. The only explanation that comes to mind is this “noise” difference between different machines but I wouldn’t be surprised if something else system/kernel related is going on. There are also some works popping up about “lottery ticket” and lucky convergence and things of the same manner, but I still didn’t see any peer reviewed explanation about inherent differences between different machines/systems.

1 Like

Thing is that on Windows 10 CUDA 10.1 with PyTorch 1.6 the UNET with different backbones seems to always start to converge after a point, while on Linux it always fails miserably.

I was starting to think that something is wrong with my code.

Nice+Sad :frowning: to see similar phenomenon happening.

Thank you for the discussion and good luck in your endeavours.

1 Like

Dear @albanD, @CalinTimbus and @Art, did you solve the problem?
I am having exactly the same problem.

Hi,

There isn’t any problem to solve as the behavior here is expected as mentioned in the citation you have.
You can check the note on reproducibility for more details here: Reproducibility — PyTorch 1.7.0 documentation

There is a problem that is not solved, and I think there are a lot of cases that have problem that “working in keras but not in pytorch”!
My code is working in keras in both PCs, but one of them working in pytorch the other not!
I lost my hope here!
Thank you for useful comments.

You would need to give more details about what you mean by “not working” as well as what your code is doing and what is expected or not for us to be able to help you.
I would recommend you open a brand new topic since it is most likely not related to the discussion in this one.

I created this:

I am struggling with the same issue. My non-local net converges with my local machine Windows 10 CUDA 11.2 RTX3060 with Pytorch 1.9. But do not converges in Linux machine, CUDA 11.2 RTX3090 with Pytorch 1.9 . What happens here? Is there anybody who knows how to fix it?

I also have the same problem. I have three identical machines with shared home directory, where source code lives. In two of them I got good training results, at least low training and validation loss. But on third machine have low training loss and very big validation loss. Code, data is the same.

That’s fascinating, thanks everyone for commenting.
Hope some official/experienced user (@ptrblck @tom @vdw ) can chime in and shine a light onto this issue.

@Nurmukhamed_Ubaidull Imagine if you had only one machine available and it would be the machine with the high loss, you wouldv’e thought your training/model/data are bad, when in fact it’s just a bug/error/quantum magic…

If you have the luck of a working reference, I would try to find out when and where things diverge.
To this end, save quantities on the working reference and load them on the non-working instance and compare.

  • Start with weights after initialization. Are they the same?
  • Grab a batch and save it on the reference and run it through both reference and broken. Is the output of the forward pass the same? If not, output/save intermediates until you find the first intermediate result that differs.
  • Are gradients the same? Again, save intermediates and call t.retain_grad() on them before backward to get intermediate gradients. (Personally, I like to collect intermediates in some global dict (DEBUG = {} at the top and then DEBUG['some-id-that-is-unique'] = t).

Most likely, you’d find a discrepancy there. If not, find out after how many batches the losses diverge and save that many batches to run them identically on both machines.

Note that dropout uses randomness. Unless you get the same random numbers from the same seed (not guaranteed across PyTorch versions, maybe not even guaranteed between machines, I don’t know), you have a bit of a headache there.

Also, find out which software versions differ etc.

Best regards & good luck

Thomas

1 Like

A very common reason why you see models train good on Windows and then train badly on Linux is because of the different listing kernels that both OS implement in the filesystem.

Assuming that your data files are named on disk with ids like {‘001.ext’, ‘002.ext’, ‘003.ext’, …}, any use of ‘os.listdir’ or ‘glob.glob’ on Windows will produce an ordered list by default , which is the order set in the filesystem. On Linux these calls return arbitrarily ordered lists, because here the default is set randomly by the filesystem. So, when loading the data on the Linux servers the training/evaluation code matches inputs and labels in a non-sensical way.
Fix: to make your Dataloaders robust across platforms wrap your listdir calls with a “sorted(…)” operator (on Python).