I am training the same LSTM network architecture with caffe and pytorch. But they give very different results.
caffe’s model accuracy is about 98%, but the accuracy of pytorch version is just 50%. Why?
the optimizers might be subtly different, where one’s learning rate or momentum scaling is a bit different than the other…
hi I meet similar problem, the result of pytorch is worse than caffe’s. Have you solved it? Thank you!
I finally got a close result to the Caffe version after I clarified some differences between Caffe and Pytorch:
- SGD implementation.
- dropout is not applied if there is only one RNN layer.
and also I found the data preprocessing of my pytorch version is slightly different from the Caffe version.
After I solved these problems, I got a comparable result.
Hope this helps!
Thanks for your reply! Could you please describe the first factor in detail? What difference matters?
You could refer to here: torch.optim — PyTorch master documentation
In the Note of SGD:
The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.
Ok, I’ll check it. Thank you very much!
I noticed this difference too. Whereas my training works fine in Caffe, in PyTorch if I change the learning rate for the same stages/iterations that caffe changes them (step-wise) suddenly I get nan
loss values.
Looking at the difference here I thought I can change the adjust_learning_rate
to also change the momentum like:
def adjust_learning_rate(optimizer, lr, momentum):
for param_group in optimizer.param_groups:
param_group['lr'] = lr
param_group['momentum'] = momentum / lr
But still, as soon as lr
is changed, the loss becomes nan
@Nick_Young how did you solve the problem with SGD discrepancy with Caffe?
As mentioned above.