Tensorflow SparseCategoricalCrossEntropy loss and Pytorch CrossEntropyLoss and adam optimizer

peter_vala · June 11, 2022, 12:07pm

Hi everyone,
I’m trying to reproduce the training between tensorflow and pytorch. I came with a simple model using only one linear layer and the dataset that I’m using is the mnist hand digit. Before testing I assign the same weights in both models and then i calculate the loss for every single input. I noticed that some of the results are really close, but not actually the same. The cause i think that could be related to this small differences could be in the own losses implementation of both frameworks and be a round float error, could be this correct? I also test making a small training example using SGD optimizer and I noticed that the error becomes bigger along iterations. So could anyone show how both losses are implemented in both frameworks in order to test if the results become the same after?

The second part is about adam optimizer is the fact that i tried to implement by the pseudo code that i saw but the differences come to early and more significant than using SGD or SGD with momentum.So, why this happens?

KFrank · June 12, 2022, 3:15am

Hi Peter!

Such a discrepancy could well be caused by floating-point round-off error.
Let’s assume that the tensorflow and pytorch implementations perform
calculations that are mathematically identical. It is, however, quite likely
that some of the floating-point operations are performed in different, but
mathematically equivalent orders, leading to differing round-off error.

I don’t know what tensorflow supports, but in pytorch it is easy to switch
your calculations from single-precision (float) to double-precision (double)
arithmetic. If you can so this in tensorflow, you could rerun your loss
computations in double precision. You would still expect tensorflow and
pytorch to produce results that differ by round-off error, but that difference
should be reduced by several orders of magnitude.

This makes sense. Two things contribute to this: As you run more training
iterations, your round-off errors will accumulate. But, furthermore, once
your tensorflow and pytorch training runs become (a little bit) different
because of one instance of (essentially trivial) round-off error, the paths
in parameter space traced out by the training runs will continue to wander
away from one another, even if you magically performed all subsequent
computations with no round-off error.

This also does not surprise me. I can’t give a detailed explanation, but the
Adam optimizer has the reputation of “jumping around more” and “being
less stable.” So once the round-off error causes the tensorflow and pytorch
training runs to differ a little, the Adam optimizer might well cause the training
paths to “wander away” from one another more rapidly.

Best.

K. Frank

peter_vala · June 13, 2022, 2:30pm

Hi @KFrank, first off all I hope everything is ok and thanks for your reply

Blockquote
I don’t know what tensorflow supports, but in pytorch it is easy to switch
your calculations from single-precision (float) to double-precision (double)
arithmetic. If you can so this in tensorflow, you could rerun your loss
computations in double precision. You would still expect tensorflow and
pytorch to produce results that differ by round-off error, but that difference
should be reduced by several orders of magnitude.

Well my goal is to find and prove the causes of these differences, so i dont actually look for closer results, but the answer for those oscillations. Could you give me an code example of the loss implementation? I’ve tried to look in source code but I can’t understand some parts.

Blockquote
This also does not surprise me. I can’t give a detailed explanation, but the
Adam optimizer has the reputation of “jumping around more” and “being
less stable.” So once the round-off error causes the tensorflow and pytorch
training runs to differ a little, the Adam optimizer might well cause the training
paths to “wander away” from one another more rapidly.

If I define the optimizer exactly the same wouldn’t that be same or would it be unstable as well?

Best regards,
Peter Vala

KFrank · June 15, 2022, 4:02pm

Hi Peter!

The cause of your differences is that even if tensorflow and pytorch are
performing the same mathematical computations – and they may well
be doing so – they are using mathematically equivalent, but numerically
different orders of operations to do so, thereby incurring differing round-off
errors.

If I were to give you a code example, you would just have a third version
performing the computation with mathematically equivalent, but numerically
different orders of operations.

There is (almost) no point in trying to chase down exactly what order of
operations is being used or reproduce some specific round-off error.
That’s just how floating-point arithmetic works (and you should rely on
pytorch to be doing something fully reasonable).

If you were to define the optimizer exactly the same way – including
ensuring that any sequences of mathematically equivalent operations
were performed in the same order so that they would be numerically
equivalent as well – you would get the same result.

(But doing so would be a huge amount of work and not really of any value.)

Best.

K. Frank