Different training results on different machines | With simplified test code

Art · November 7, 2021, 9:47am

That’s fascinating, thanks everyone for commenting.
Hope some official/experienced user (@ptrblck @tom @vdw ) can chime in and shine a light onto this issue.

@Nurmukhamed_Ubaidull Imagine if you had only one machine available and it would be the machine with the high loss, you wouldv’e thought your training/model/data are bad, when in fact it’s just a bug/error/quantum magic…

tom · November 7, 2021, 6:29pm

If you have the luck of a working reference, I would try to find out when and where things diverge.
To this end, save quantities on the working reference and load them on the non-working instance and compare.

Start with weights after initialization. Are they the same?
Grab a batch and save it on the reference and run it through both reference and broken. Is the output of the forward pass the same? If not, output/save intermediates until you find the first intermediate result that differs.
Are gradients the same? Again, save intermediates and call t.retain_grad() on them before backward to get intermediate gradients. (Personally, I like to collect intermediates in some global dict (DEBUG = {} at the top and then DEBUG['some-id-that-is-unique'] = t).

Most likely, you’d find a discrepancy there. If not, find out after how many batches the losses diverge and save that many batches to run them identically on both machines.

Note that dropout uses randomness. Unless you get the same random numbers from the same seed (not guaranteed across PyTorch versions, maybe not even guaranteed between machines, I don’t know), you have a bit of a headache there.

Also, find out which software versions differ etc.

Best regards & good luck

Thomas

Javier_Grau · April 28, 2023, 12:34pm

A very common reason why you see models train good on Windows and then train badly on Linux is because of the different listing kernels that both OS implement in the filesystem.

Assuming that your data files are named on disk with ids like {‘001.ext’, ‘002.ext’, ‘003.ext’, …}, any use of ‘os.listdir’ or ‘glob.glob’ on Windows will produce an ordered list by default , which is the order set in the filesystem. On Linux these calls return arbitrarily ordered lists, because here the default is set randomly by the filesystem. So, when loading the data on the Linux servers the training/evaluation code matches inputs and labels in a non-sensical way.
Fix: to make your Dataloaders robust across platforms wrap your listdir calls with a “sorted(…)” operator (on Python).