There are many factors that can cause differences. Some people have reported things to try here.
Same problem here. Cannot replicate TF Adam optimizer success in Pytorch.
Edit: Disregard. I’m actually getting better loss in Pytorch over TF with Adam now that I’m actually taking the mean of my losses.
size_average=False found in jcjohnson’s github examples can make for a long night for a newbie.
I also have the same problem.
I implemented AE and VAE on both Keras(Tensorflow) and Pytorch.
Using Adadelta gave me different loss values and Pytorch did the worst thing on my network.
I spent 2 weeks to double check my codes untill I found this post.
Thank you guys that I am not the only one who experiences this issue.
Same problem here!
More specifically, it turns out that Pytorch training with Adam will stuck at a worse level (in terms of both loss and accuracy) than Tensorflow with exactly the same setting. I came across this issue in two process:
(1) standard training of a VGG-16 model with CIFAR-10 as dataset.
(2) generating CW L2 attack. See https://github.com/carlini/nn_robust_attacks/blob/master/l2_attack.py for details. I reproduce this attack method to test my model trained with Pytorch. The loss also stuck at a undesirable level for some images, and the adversarial counterparts couldn’t be generated.
Interestingly, I solved these issues by manually letting the learning rate decay to its half at scheduled step (e.g. lr = 0.5 * lr, every 20 epochs). After doing so, Pytorch could reach comparable results as Tensorflow (without decaying its learning rate), and everything works fine for me.
However, I think that actually Adam should adjust its learning rate automatically. So I still don’t know the true reason for this.
In general, a whole learning system consists of:
- data loading (including train/val/test split, data augmentation, batching, etc)
- prediction model (your neural network)
- loss computation
- gradient computation
- model initialization
- metric (accuracy, precision, etc) computation
In my experience, double check every aspect of you code before concluding it is an optimizer-related issue (Most of the time, it’s not…).
Specifically, you can do the followings to check the correctness of your code:
- [easy check] switch optimizers (SGD, SGD + momentum, etc.) and check if the performance gap persists
- [easy check] disable more advanced techniques like BatchNorm, Dropout and check the final performance
- use the same dataloader (therefore, both tensorflow and pytorch will get the same inputs for every batch) and check the final performance
- use the same inputs, check both the forward and backward outputs
@bily’s suggestions seem very reasonable.
If you still have some issues getting approx. the same results, I would like to dig a bit deeper.
Also, it would help if you could provide executable scripts for both implementations.
Also, since the loss function is non-convex, random weight initialization can make huge difference. I recommend repeating the experiment with ~5 different random seeds in both frameworks (TensorFlow, PyTorch and then compare the top ~1-3 results.