Suboptimal convergence when compared with TensorFlow model


(Kai Arulkumaran) #21

There are many factors that can cause differences. Some people have reported things to try here.


(Jeff) #22

Same problem here. Cannot replicate TF Adam optimizer success in Pytorch.

Edit: Disregard. I’m actually getting better loss in Pytorch over TF with Adam now that I’m actually taking the mean of my losses.
size_average=False found in jcjohnson’s github examples can make for a long night for a newbie.


(baboonga) #23

I also have the same problem.
I implemented AE and VAE on both Keras(Tensorflow) and Pytorch.
Using Adadelta gave me different loss values and Pytorch did the worst thing on my network.
I spent 2 weeks to double check my codes untill I found this post.
Thank you guys that I am not the only one who experiences this issue.


#24

Same problem here!

More specifically, it turns out that Pytorch training with Adam will stuck at a worse level (in terms of both loss and accuracy) than Tensorflow with exactly the same setting. I came across this issue in two process:

(1) standard training of a VGG-16 model with CIFAR-10 as dataset.
(2) generating CW L2 attack. See https://github.com/carlini/nn_robust_attacks/blob/master/l2_attack.py for details. I reproduce this attack method to test my model trained with Pytorch. The loss also stuck at a undesirable level for some images, and the adversarial counterparts couldn’t be generated.

Interestingly, I solved these issues by manually letting the learning rate decay to its half at scheduled step (e.g. lr = 0.5 * lr, every 20 epochs). After doing so, Pytorch could reach comparable results as Tensorflow (without decaying its learning rate), and everything works fine for me.

However, I think that actually Adam should adjust its learning rate automatically. So I still don’t know the true reason for this.


#25

In general, a whole learning system consists of:

  1. data loading (including train/val/test split, data augmentation, batching, etc)
  2. prediction model (your neural network)
  3. loss computation
  4. gradient computation
  5. model initialization
  6. optimization
  7. metric (accuracy, precision, etc) computation

In my experience, double check every aspect of you code before concluding it is an optimizer-related issue (Most of the time, it’s not…).

Specifically, you can do the followings to check the correctness of your code:

  • [easy check] switch optimizers (SGD, SGD + momentum, etc.) and check if the performance gap persists
  • [easy check] disable more advanced techniques like BatchNorm, Dropout and check the final performance
  • use the same dataloader (therefore, both tensorflow and pytorch will get the same inputs for every batch) and check the final performance
  • use the same inputs, check both the forward and backward outputs

Good Luck.


(Rahul Deora) #26

Can anyone from the PyTorch Dev team address this issue? @ptrblck @smth


#27

@bily’s suggestions seem very reasonable.
If you still have some issues getting approx. the same results, I would like to dig a bit deeper.
Also, it would help if you could provide executable scripts for both implementations.


(Sebastian Raschka) #28

Also, since the loss function is non-convex, random weight initialization can make huge difference. I recommend repeating the experiment with ~5 different random seeds in both frameworks (TensorFlow, PyTorch and then compare the top ~1-3 results.