Different training results on different machines | With simplified test code

If you have the luck of a working reference, I would try to find out when and where things diverge.
To this end, save quantities on the working reference and load them on the non-working instance and compare.

  • Start with weights after initialization. Are they the same?
  • Grab a batch and save it on the reference and run it through both reference and broken. Is the output of the forward pass the same? If not, output/save intermediates until you find the first intermediate result that differs.
  • Are gradients the same? Again, save intermediates and call t.retain_grad() on them before backward to get intermediate gradients. (Personally, I like to collect intermediates in some global dict (DEBUG = {} at the top and then DEBUG['some-id-that-is-unique'] = t).

Most likely, you’d find a discrepancy there. If not, find out after how many batches the losses diverge and save that many batches to run them identically on both machines.

Note that dropout uses randomness. Unless you get the same random numbers from the same seed (not guaranteed across PyTorch versions, maybe not even guaranteed between machines, I don’t know), you have a bit of a headache there.

Also, find out which software versions differ etc.

Best regards & good luck

Thomas

1 Like