Note that the efficiency of algorithm 1 can, at the expense of clarity, be improved upon by changing the order of computation, e.g. by replacing the last three lines in the loop with the following lines…
So I’m surprised that it should make a noticeable difference, but maybe that is the case. @christianperone would you mind trying the altered version of Adam on your problem? Fingers crossed this might be the solution.
The bit about sparse updates with TensorFlow Adam I would assume don’t matter in this case. I don’t know how PyTorch deals with sparse modules wrt gradient updates, but what TF claims to do sounds like the correct approach.
It was this part that made me think it could lead to noticeable difference–
Tensorflow Adam – “ The sparse implementation of this algorithm (used when the gradient is an
IndexedSlices object, typically because of `tf.gather` or an embedding
lookup in the forward pass) does apply momentum to variable slices even if
they were not used in the forward pass (meaning they have a gradient equal to zero. Momentum decay (beta1) is also applied to the entire momentum
accumulator. This means that the sparse behavior is equivalent to the dense
behavior (in contrast to some momentum implementations which ignore momentum
unless a variable slice was actually used).”
As I see a lot of training embedding models in pytorch and would be comparing to tensorflow I bet a lot of these performance differences stem from that as it auto applies the momentum decay and we would have default not too.
Anyways I have always been able to get just as good or better than tensorflow performance but I usually use custom stuff most the time but the underlying framework has shown no insuffiency in performance for me and usually find quite the opposite
I’ll test it. If someone has the code change in hands that would help a lot, otherwise I’ll have to come back to this in near future due to my time constraints. Thanks for the help !
I also experienced suboptimal behaviour with Adam compared to SGD in PyTorch. Similar code in Tensorflow performed the other way around, i.e. optimizing with Adam was much easier. I have also used an Embedding layer.
Same problem here. Cannot replicate TF Adam optimizer success in Pytorch.
Edit: Disregard. I’m actually getting better loss in Pytorch over TF with Adam now that I’m actually taking the mean of my losses.
size_average=False found in jcjohnson’s github examples can make for a long night for a newbie.
I also have the same problem.
I implemented AE and VAE on both Keras(Tensorflow) and Pytorch.
Using Adadelta gave me different loss values and Pytorch did the worst thing on my network.
I spent 2 weeks to double check my codes untill I found this post.
Thank you guys that I am not the only one who experiences this issue.
More specifically, it turns out that Pytorch training with Adam will stuck at a worse level (in terms of both loss and accuracy) than Tensorflow with exactly the same setting. I came across this issue in two process:
(1) standard training of a VGG-16 model with CIFAR-10 as dataset.
(2) generating CW L2 attack. See https://github.com/carlini/nn_robust_attacks/blob/master/l2_attack.py for details. I reproduce this attack method to test my model trained with Pytorch. The loss also stuck at a undesirable level for some images, and the adversarial counterparts couldn’t be generated.
Interestingly, I solved these issues by manually letting the learning rate decay to its half at scheduled step (e.g. lr = 0.5 * lr, every 20 epochs). After doing so, Pytorch could reach comparable results as Tensorflow (without decaying its learning rate), and everything works fine for me.
However, I think that actually Adam should adjust its learning rate automatically. So I still don’t know the true reason for this.
@bily’s suggestions seem very reasonable.
If you still have some issues getting approx. the same results, I would like to dig a bit deeper.
Also, it would help if you could provide executable scripts for both implementations.
Also, since the loss function is non-convex, random weight initialization can make huge difference. I recommend repeating the experiment with ~5 different random seeds in both frameworks (TensorFlow, PyTorch and then compare the top ~1-3 results.
I’m having the same problem, and spent long time to double check all what @bily suggested.
Here are two projects, one is the original Tensorflow code of a paper called “Fast-Slow Recurrent Neural Networks”, which had state of the art results in Language Model task.
The second is my Pytorch implementation.
I got poor results using Adam optimizer. I also tried different optimizers on both implementations, but still got poor results. It seems that no matter what optimizer I choose, the Pytorch loss stack at some level where TF loss keep getting smaller.
No, I didn’t do one to one comparison. I would have to export the initial weights for that manner from TF classifier to the Pytorch one and then run the network.
And also make sure that the input is the same and in the same order of course.
What I did do is I checked that each batch contains the same samples in both implementations. It does.
But the batches don’t come at the same order which shouldn’t be a problem.
I also checked that the gradients are pretty much on the same scale during the run. Means that after each batch I printed the gradients and look at the numbers. So in the first batches, the gradients are big and then getting lower during the epochs. Same scale in both implementations.
I’ll consider your advice about trying to replicate the results of the TF network with my Pytorch one.
Thanks.
Another thing to consider is that I think Tf and PyTorch use different default weight initialization schemes, which may also have an effect (and will also effect the learning rate etc etc)