Suboptimal convergence when compared with TensorFlow model

I ported a simple model (using dilated convolutions) from TensorFlow (written in Keras) to pytorch (last stable version) and the convergence is very different on pytorch, leading to results that are good but not even close of the results I got with TensorFlow. So I wonder if there are differences on optimizers in Pytorch, what I already checked is:

  • Same parameters for optimizer (Adam)
  • Same loss function
  • Same initialization
  • Same learning rate
  • Same architecture
  • Same amount of parameters
  • Same data augmentation / batch size

So I wonder, what else should I check ? It seems that everything was covered already. Any ideas ?

Example of convergence of the loss (unfortunately the two series are the same color, but pytorch is the one below), for this loss, higher is better:

Thank you !

12 Likes

I have had similar issues with Pytorch vs Keras, but while I havenā€™t found a simple answer, these are other things I would check:

  • Is Keras using any regularizers or constraints?
  • Is Keras using biases whilst PyTorch is not?
  • Are you computing the loss the exact same way?

Thanks for the answer Rodrigo ! Iā€™m not using any constraint or regularizer and biases are also the same. Iā€™ll try to double check everything again, but this is really weird.

Sometimes subtle differences in the definitions can make comparisons difficult, e.g. when a mean is taken in the loss function when your comparison uses a sum, you would need to compensate for that.
Also there are quite a few parameters to the optimizer that have differing defaults between implementations.
Iā€™d probably try to start both with the same weights and see where it differs.

Best regards

Thomas

Thanks Thomas ! Iā€™ll certainly check that !

I have some news regarding this issue:

  • I initialized the model in Pytorch with the same weights of a model trained on Keras (using TensorFlow backend) and suprisingly, the results of this new model with the same weights yield the SAME results from Keras model;

  • However, if I train this same model on Pytorch (even using the same initialization weights), Pytorch always yields suboptimal results, like the one saw on the image above;

My opinion is that something is really weird with the Adam optimizer in Pytorch, yielding poor results when compared to Keras/TensorFlow. So my question is: how were these optimizers tested ? It is consistently yielding poor results to me. Does anyone saw that before ? Is there any workaround for that ?

Thanks !

7 Likes

Did you check the initialization and the loss, too?
I think keras uses Xavier Glorotā€™s method by default for some layers.
You probably could extend your comparison to the gradient and possible regularizer and then look at the optimizer step itself.
The good news is that you can probably just ā€œbisectā€ the backward pass if you find that the gradients differ.

Best regards

Thomas

(who is also trying to stare down a model where he cannot make sense of the apparent training deficiency)

4 Likes

Thanks @tom, actually I used the same initial weights (not only the same initialization method) to train it, so it really seems to be something fundamentally wrong with the Adam optimizer itself (I also checked the loss multiple times). I think that Iā€™ll wait for Pytorch to stabilize because I donā€™t have so much time to invest in debugging it, unfortunately. Dissecting every aspect of the model takes a lot of time. Good luck with your model by the way !!

This is peculiar, but thanks for testing this thoroughly. Iā€™ve had a quick look at PyTorch Adam (seems fine) vs. TensorFlow Adam (complex, but also seems fine) vs. Keras Adam (also seems fine), and canā€™t spot any issues, but perhaps someone more observant will.

2 Likes

Just to complement, I also tried this change: https://github.com/pytorch/pytorch/issues/2060 but it didnā€™t change anything in the convergence.

at first glance. it looks like tensor flow is using a slightly different version of episilon definition in their Adam . They are using the the ā€œepsilon hatā€ version

they replace these three lines of algorithm:

m t ā†mt/(1āˆ’Ī²1t)(Computebias-correctedfirstmomentestimate)
v t ā† vt /(1 āˆ’ Ī²2t ) (Compute bias-corrected second raw moment estimate) āˆš
Īøt ā†Īøtāˆ’1 āˆ’Ī±Ā·m t/( v t +Īµ)(Updateparameters)

with these two lines:

Ī±t =Ī±Ā· 1āˆ’Ī²2t/(1āˆ’Ī²1t)
Īøt ā†Īøtāˆ’1 āˆ’Ī±t Ā·mt/(āˆšvt +ĪµĖ†).

and just took a glance at keras and seems they are too

EDIT: Scratch that. We are using the same here as well. We use the bottom two lines as well

3 Likes

Tensorflow Adam ā€“ ā€œMomentum decay (beta1) is also applied to the entire momentum
accumulator. This means that the sparse behavior is equivalent to the dense
behavior (in contrast to some momentum implementations which ignore momentum
unless a variable slice was actually used).ā€

Yeah this will cause performance differences.

But itā€™s not that pytorch is using a weird Adam optimizer in contrary itā€™s looks like pytorch just has the plain vanilla version. You guys see the same thing?

1 Like

The paper says:

Note that the efficiency of algorithm 1 can, at the expense of clarity, be improved upon by changing the order of computation, e.g. by replacing the last three lines in the loop with the following linesā€¦

So Iā€™m surprised that it should make a noticeable difference, but maybe that is the case. @christianperone would you mind trying the altered version of Adam on your problem? Fingers crossed this might be the solution.

The bit about sparse updates with TensorFlow Adam I would assume donā€™t matter in this case. I donā€™t know how PyTorch deals with sparse modules wrt gradient updates, but what TF claims to do sounds like the correct approach.

2 Likes

It was this part that made me think it could lead to noticeable differenceā€“

Tensorflow Adam ā€“ ā€œ    The sparse implementation of this algorithm (used when the gradient is an
    IndexedSlices object, typically because of `tf.gather` or an embedding
    lookup in the forward pass) does apply momentum to variable slices even if
    they were not used in the forward pass (meaning they have a gradient equal to zero. Momentum decay (beta1) is also applied to the entire momentum
accumulator. This means that the sparse behavior is equivalent to the dense
behavior (in contrast to some momentum implementations which ignore momentum
unless a variable slice was actually used).ā€

As I see a lot of training embedding models in pytorch and would be comparing to tensorflow I bet a lot of these performance differences stem from that as it auto applies the momentum decay and we would have default not too.

Anyways I have always been able to get just as good or better than tensorflow performance but I usually use custom stuff most the time but the underlying framework has shown no insuffiency in performance for me and usually find quite the opposite

Small minute differences in hyperparameters do often show unproportional performance differences in my experience

Iā€™ll test it. If someone has the code change in hands that would help a lot, otherwise Iā€™ll have to come back to this in near future due to my time constraints. Thanks for the help !

4 Likes

I also experienced suboptimal behaviour with Adam compared to SGD in PyTorch. Similar code in Tensorflow performed the other way around, i.e. optimizing with Adam was much easier. I have also used an Embedding layer.

4 Likes

I thought I was the only one! Same problem here: RNN and Adam: slower convergence than Keras

When Iā€™ll have time Iā€™ll try with other optimizers.

EDIT: same situation with RMSProp.

Upā€¦ Shouldnā€™t this problem be investigated?

6 Likes

Same issue here, same model architecture in Keras that is trained using Adam, gives better result comparing with Pytorch.

1 Like