Models slowly diverge on different machines

jplehmann · February 7, 2019, 11:52pm

I am using a simple LSTM network feeding its final output into a single layer that performs binary classification. I get stable results on any given machine. When training locally on my macbook versus on a GCE instance for hundreds of training samples the results appear identical. However, after I have trained on about 800 samples the outputs are similar noticeably different.

I realize that there are a number of factors that will affect reproducibility, including platform. However, from what I have seen it may be only the case that different results are expected between CPU and GPU. I am seeding the torch RNG and it appears to generate the same numbers.

Is it divergence to be attributed to an accumulation of small floating point operation inaccuracies which are slightly different on each architecture?

ptrblck · February 8, 2019, 12:01am

Seems like a valid idea. Did you measure the difference and if so do you see an increasing trend?

jplehmann · February 8, 2019, 12:08am

Exactly. After 100 samples, the loss is identical, but after 200 samples, you can start to see it slightly diverging as shown here:

After 200 samples:

outputs: Y_hat: tensor([0.8876], grad_fn=<ViewBackward>), Y: tensor([1.]), loss: 0.11922011524438858
outputs: Y_hat: tensor([0.8876], grad_fn=<ViewBackward>), Y: tensor([1.]), loss: 0.11922004818916321

By 700 or so it’s obvious. After 30 epochs over 3000 samples, the evaluation shows the models make different choices:

Fscore(keys=986, cor=963, mis=10, spu=13) 0.988 F (0.987 P / 0.990 R)
Fscore(keys=986, cor=974, mis=10, spu=2) 0.994 F (0.998 P / 0.990 R)

That being said, it’s possible those 9 additional errors were all the same kind of thing, or whatever. But the point is it’s definitely different.

Just wanted to see if anything else might be going on.

ptrblck · February 8, 2019, 12:18am

Thanks for the information!
The error after 200 samples is approx 1e-7, while float32 precision is ~1e-6, so that could be the reason.
Just for the sake of debugging, you could use float64 and see, if the differences occur a bit later in the training.

jplehmann · February 8, 2019, 9:05pm

Good idea!

I tried it and can confirm – switching to float64 causes the two systems to maintain parity for several times longer.