Suboptimal convergence when compared with TensorFlow model

Kaixhin · February 7, 2018, 4:34pm

There are many factors that can cause differences. Some people have reported things to try here.

GL_Jeff · June 13, 2018, 12:00pm

Same problem here. Cannot replicate TF Adam optimizer success in Pytorch.

Edit: Disregard. I’m actually getting better loss in Pytorch over TF with Adam now that I’m actually taking the mean of my losses.
size_average=False found in jcjohnson’s github examples can make for a long night for a newbie.

baboonga · July 23, 2018, 3:18pm

I also have the same problem.
I implemented AE and VAE on both Keras(Tensorflow) and Pytorch.
Using Adadelta gave me different loss values and Pytorch did the worst thing on my network.
I spent 2 weeks to double check my codes untill I found this post.
Thank you guys that I am not the only one who experiences this issue.

zjysteven · July 25, 2018, 1:06am

Same problem here!

More specifically, it turns out that Pytorch training with Adam will stuck at a worse level (in terms of both loss and accuracy) than Tensorflow with exactly the same setting. I came across this issue in two process:

(1) standard training of a VGG-16 model with CIFAR-10 as dataset.
(2) generating CW L2 attack. See https://github.com/carlini/nn_robust_attacks/blob/master/l2_attack.py for details. I reproduce this attack method to test my model trained with Pytorch. The loss also stuck at a undesirable level for some images, and the adversarial counterparts couldn’t be generated.

Interestingly, I solved these issues by manually letting the learning rate decay to its half at scheduled step (e.g. lr = 0.5 * lr, every 20 epochs). After doing so, Pytorch could reach comparable results as Tensorflow (without decaying its learning rate), and everything works fine for me.

However, I think that actually Adam should adjust its learning rate automatically. So I still don’t know the true reason for this.

bily · November 28, 2018, 1:28pm

In general, a whole learning system consists of:

data loading (including train/val/test split, data augmentation, batching, etc)
prediction model (your neural network)
loss computation
gradient computation
model initialization
optimization
metric (accuracy, precision, etc) computation

In my experience, double check every aspect of you code before concluding it is an optimizer-related issue (Most of the time, it’s not…).

Specifically, you can do the followings to check the correctness of your code:

[easy check] switch optimizers (SGD, SGD + momentum, etc.) and check if the performance gap persists
[easy check] disable more advanced techniques like BatchNorm, Dropout and check the final performance
use the same dataloader (therefore, both tensorflow and pytorch will get the same inputs for every batch) and check the final performance
use the same inputs, check both the forward and backward outputs

Good Luck.

bluesky314 · December 1, 2018, 5:58pm

Can anyone from the PyTorch Dev team address this issue? @ptrblck @smth

ptrblck · December 1, 2018, 6:54pm

@bily’s suggestions seem very reasonable.
If you still have some issues getting approx. the same results, I would like to dig a bit deeper.
Also, it would help if you could provide executable scripts for both implementations.

rasbt · December 2, 2018, 6:04am

Also, since the loss function is non-convex, random weight initialization can make huge difference. I recommend repeating the experiment with ~5 different random seeds in both frameworks (TensorFlow, PyTorch and then compare the top ~1-3 results.

Shay.G · February 12, 2019, 1:15pm

Hi,

I’m having the same problem, and spent long time to double check all what @bily suggested.
Here are two projects, one is the original Tensorflow code of a paper called “Fast-Slow Recurrent Neural Networks”, which had state of the art results in Language Model task.
The second is my Pytorch implementation.
I got poor results using Adam optimizer. I also tried different optimizers on both implementations, but still got poor results. It seems that no matter what optimizer I choose, the Pytorch loss stack at some level where TF loss keep getting smaller.

Here are the links for both project in my github account:
Pytorch implementation: https://github.com/shaygeller/Fast-Slow-LSTM.git
TF implementation: https://github.com/shaygeller/Fast-Slow-LSTM-TF.git

I removed the fancy optimizations in both implementatinos (like zoneout and layer normalization) but still got poor results in Pytorch compared to TF.

The architecture is not complicated at all, its only 3 LSTM cells. Just look at the forward method to understand it.

I’ll appreciate you response on it.

tom · February 12, 2019, 3:29pm

What I usually do at that point:

Do you get the same outputs for the same inputs (I usually save a batch from TF in numpy format when I do this)?
If so, do you get the same gradients?
(again, I usually save the TF gradients in numpy to compare)

Best regards

Thomas

Shay.G · February 13, 2019, 10:22am

Hi Tom,

No, I didn’t do one to one comparison. I would have to export the initial weights for that manner from TF classifier to the Pytorch one and then run the network.
And also make sure that the input is the same and in the same order of course.

What I did do is I checked that each batch contains the same samples in both implementations. It does.
But the batches don’t come at the same order which shouldn’t be a problem.

I also checked that the gradients are pretty much on the same scale during the run. Means that after each batch I printed the gradients and look at the numbers. So in the first batches, the gradients are big and then getting lower during the epochs. Same scale in both implementations.

I’ll consider your advice about trying to replicate the results of the TF network with my Pytorch one.
Thanks.

rasbt · March 3, 2019, 6:50am

Another thing to consider is that I think Tf and PyTorch use different default weight initialization schemes, which may also have an effect (and will also effect the learning rate etc etc)

duygusar · March 5, 2019, 2:51pm

I am having the same problem, pytorch’s adam gets stuck around a validation loss and won’t improve no matter what. I am surprised this issue is not handled although it dates Jul 2017. I mean what could be more important than optimizer/convergence?

vitchyr · May 17, 2019, 1:44am

I know this is a long shot, but did you ever get around to testing this? I’m having similar issues reproducing TensorFlow results with PyTorch, and I currently suspect that it’s due to the Adam optimizer.

baboonga · May 21, 2019, 1:51pm

I think this weight initialization is a big issue one.
I tested the same implementation on both pytorch and keras and I found that I got a suboptimal problem.

I once accidentally implemented keras in a different way ( I don’t remember how I implemented at that moment.) and I got a suboptimal issue on keras.

So I made a conclusion that it would be the weight initialization that caused the suboptimal problem.
This may apply to pytorch as well.

tengerye · October 24, 2019, 3:31am

@tom Hi, have you found out the real problem? I have similar issue as well and I don’t believe in accidents.

tom · October 24, 2019, 8:41am

Well, so I converted a few models (e.g. StyleGAN with in joint work with @ptrblck ) with the above method, and it works very well for me.

It should be said that “I ported my program to X and now it’s not doing the same as before” isn’t a single thing, just like - you brought up accidents - “there was a car crash” doesn’t have a unique cause that you can find and solve.

Some people are very good at getting these things (sees e.g. HuggingFace’s awesome transformer work) to work similarly across frameworks, but it is a skill you have to build.

Best regards

Thomas

ChengchengWei · December 24, 2019, 1:14pm

The same.

I set optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.2) for pytorch, and luckily the performance is comparable with tensorflow.

And I do not know the reason.

alviur · February 4, 2020, 5:55pm

Same issue here with a custom layer made in TF and ported to Pytorch

hrdeepak · April 19, 2020, 10:01am

In my experiment, however, I followed these to and ended up with similar results:

Used nn.init.xavier_uniform_ for weights and nn.constant_ for the biases.
In the adam optimizer, PyTorch uses default eps=1e-8 vs TensorFlow’s epsilon=1e-7.Changed it to 1e-7

Hope this helps